Introduction:
=============

This document provides information on the implementation side
design decisions and the blueprint of the implementation itself.


Directory structure:
====================

* /etc/guerillabackup: This is the default configuration directory.
  * config: This is the main GuerillaBackup configuration file.
    Settings can be overridden e.g. in unit configuration files.
  * keys: This directory is the default backup encryption key
    location. Currently this is the home directory of a GnuPG
    key store.
  * lib-enabled: This directory is included in the site-path by
    default. Add symbolic links to include specific Python packages
    or machine/organisation specific code.
  * units: The units directory contains the enabled backup data
    generation units. To enable a unit, a symbolic link to the
    unit definition file has to be created. The name of the symbolic
    link has to consist only of letters and numbers. For units with
    an associated configuration file named "[unitname].config",
    configuration parameters from the main configuration file
    can be overridden within the unit-specific configuration.
* /var/lib/guerillabackup: This directory is usually only readable
  by root user unless transfer agents with different UID are configured.
  * data: where backuped data from local backups is stored, usually
    by the default sink.
  * state: State persistency directory for all backup procedures.
  * state/generators/[UnitName]: File or directory to store state
    data for a given backup unit.
  * state/agents: Directory to store additional information of
    local backup data processing or remote transfer agents.
* /run/guerillabackup: This directory is used to keep data, only
  needed while guerillabackup tools are running. This data can
  be discarded on reboot.
  * transfer.socket: Default socket location for "gb-transfer-service".


Library functions:
==================

* Configuration loading:

Configuration loading happens in 2 stages:

  * Loading of the main configuration.
  * Loading of a component/module specific overlay configuration.
    This is allows tools to perform modularised tasks, e.g. a backup
    generator processing different sources, to apply user-defined
    configuration alterations to the configuration of a single unit.
    The overlay configuration is then merged with the main configuration.

Defaults have to be set in the main configuration. A tool may
refuse to start when required default values are missing in the
configuration.


Backup Generator:
=================

* Process pipelines:

Pipeline implementation is designed to support both operating
system processes using only file descriptors for streaming,
pure Python processes, that need to be run in a separate thread
or are polled for normal operation and a mixture of both, i.e.
only one of input or output is an operating system pipe. Because
of the last case, a pipeline instance may behave like a synchronous
component at the beginning until the synchronous input was processed
and the end of input data was reached. From that moment on till
all processing is finished it behaves like an asynchronous component.

  * State handling:

A pipeline instance may only change its state when the doProcess()
or isRunning() method is called. It is forbidden to invoke doProcess()
on an instance not running any more. Therefore after isRunning()
returned true, it is save to call doProcess().

  * Error handling:

Standard way to get processing errors is by calling the doProcess
method, even when process is asynchronous. On error, the method
should always return the same error message for a broken process
until stop() is called.

One error variant is, that operating system processes did not
read all input from their input pipes and some data remains in
buffers. This error has to be reported to the caller either from
doProcess() or stop(), whatever comes first. The correct detection
of unprocessed input data may fail, if a downstream component
is stopped while the upstream is running and writing data to
a pipe after the checks.

  * Pipeline element implementation:

    * DigestPipelineElement:

A synchronous element creating a checksum of all data passing
through it.

    * GpgEncryptionPipelineElement:

A pipeline element returning a generic OSProcessPipelineExecutionInstance
to perform encryption using GnuPG.

    * OSProcessPipelineElement:

This pipeline element will create an operating system level process
wrapped in a OSProcessPipelineExecutionInstance to perform the
transformation. The instance may be fully asynchronous when it
is connected only by operating system pipes but is synchronous
when connected via stdin/stdout data moving.


Policy Based Data Synchronisation:
==================================

To support various requirements, e.g. decentralised backup generation
with secure secure spooling, asynchronous transfers, a the "gb-transfer-service",
a component for synchronisation is required, see "doc/Design.txt"
section "Synchronisation" for design information.

The implementation of "gb-transfer-service" orchestrates all components
required related to following functional blocks:

* ConnectorService: This service provides functions to establish
  connectivity to other "gb-transfer-service" instances. Currently
  only "SocketConnectorService" together with protocol handler
  "JsonStreamServerProtocolRequestHandler" is supported. The
  service has to care about authentication and basic service
  access authorisation.

* Policies: Policies define, how the "gb-transfer-service" should
  interact with other "gb-transfer-service" instances. There are
  two types of policies, "ReceiverTransferPolicy" for incoming
  transfers and "SenderTransferPolicy" for transmitting data.
  See "ReceiverStoreDataTransferPolicy", "SenderMoveDataTransferPolicy",
  for currently supported policies.

* Storage: A storage to store, fetch and delete StorageBackupDataElements.
  Some storages may support storing of custom annotation data
  per element. This can then be used in policies to perform policy
  decisions, e.g. to prioritise sending of files according to tags.
  Current storage implementation is "DefaultFileStorage".

* TransferAgent: The agent keeps track of all current connections
  created via the ConnectorService. It may control load balancing
  between multiple connections. Current available agent implementation
  is "SimpleTransferAgent".


Classes and interfaces:

* ClientProtocolInterface:

Classes implementing this interface are passed to the TransferAgent
by the ConnectorService to allow outbound calls to the other agent.

* ConnectorService:

A service to establish in or outbound connections to an active
TransferAgent. Implementation will vary depending on underlying
protocol, e.g. TCP, socket, ... and authentication type, which
is also handled by the ConnectorService.

* DefaultFileStorage:

This storage implementation stores all relevant information on
the filesystem, supporting locking and extra attribute handling.
It uses the element name to create the storage file names, appending
"data", "info" or "lock" to it for content, meta information
storage and locking. Extra attribute data is stored in by using
the attribute name as file extension. Thus extensions from above
but also ones containing dashes or dots are not allowed.

* JsonStreamServerProtocolRequestHandler:

This handler implements a minimalistic JSON protocol to invoke
ServerProtocolInterface methods. See "doc/Design.txt" section
"Transmission protocol" for protocol design information.

* ReceiverStoreDataTransferPolicy:

This class defines a receiver policy, that attempts to fetch all
data elements offered by the remote transfer agent.

* ReceiverTransferPolicy:

This is the common superinterface of all receiver transfer policies.

* ServerProtocolInterface:

This is the server side protocol adaptor to be provided to the
transfer service to forward remote requests to the local SenderPolicy.

* SenderMoveDataTransferPolicy(SenderTransferPolicy):

This is a simple sender transfer policy just advertising all resources
for transfer and removing them or marking them as transferred as
soon as remote side confirms successful transfer. A file with a
mark will not be offered for download any more.
  * applyPolicy(): deletes the file when transfer was successful.

* SenderTransferPolicy:

This is the common superinterface of all sender side transfer
policies. A policy implementation may require to adjust the internal
state after data was transferred.
  * queryBackupDataElements(): return an iterator over all elements
    eligible for transfer by the current policy. The query may
    support remote side supplied query data for optimisation.
    This should of course only be used when the remote side knows
    the policy.
  * applyPolicy(): update internal state after data transfer
    was rejected, attempted or even successful.

* SocketConnectorService:

This is currently the only ConnectorService available. It accepts
incoming connections on a local UNIX socket. Authentication and
socket access authorisation has to be handled UNIX permissions
or integration with other tools, e.g. "socat". For each incoming
connection it uses a "JsonStreamServerProtocolRequestHandler"
protocol handler.

* TransferAgent:

This class provides the core functionality for in and outbound
transfers. It has a single sender or receiver transfer policy
or both attached. Protocol connections are attached to it using
a ConnectorService. The agent does not care about authentication
any more: everything relevant for authorisation has to be provided
by the ConnectorService and stored to the TransferContext.


gb-storage-tool:
============:

The workflow described in "Design.txt" is implemented as such:

* Load the master configuration and all included configurations:

This really creates just the "StorageConfig" and validates that
directories or files exist referenced by the configuration exist.
It also loads the "StorageStatus" (if any) but does not validate
any entries inside it.

A subconfiguration is only loaded after initialisation of the
parent configuration was completed.

See "StorageTool.loadConfiguration", "StorageConfig.__init__".

* Locate the storage data and meta information files in each
  configuration data directory:

This step is done for the complete configuration tree with main
configuration as root and all included subconfiguration branches
and leaves. Starting with the most specific (leaf) configurations,
all files in the data directory are listed, associated with the
current configuration loading it. After processing of leaves
the same is done for the branch configuration including the leaves,
thus adding files not yet covered by the leaves and branches.

When all available storage files are known, they are grouped
into backup data elements (data and metadata) before grouping
also elements from the same source.

See "StorageTool.initializeStorage", "StorageConfig.initializeStorage".

* Check for applicable but yet unused policy templates and apply
  them to files matching the selection pattern:
