4.1. Concepts#

Algorithms are executed at the (vantage6-)node. The node receives a computation task from the vantage6-server. The node will then retrieve the algorithm, execute it and return the results to the server.

Algorithms are shared using Docker images which are stored in a Docker image registry which is accessible to the nodes. In the following sections we explain the fundamentals of algorithm containers.

  1. Input & output Interface between the node and algorithm container

  2. Wrappers Library to simplify and standardized the node-algorithm input and output

  3. Child containers Creating subtasks from an algorithm container

  4. Networking Communicate with other algorithm containers and the vantage6-server

  5. Cross language Cross language data serialization

  6. Package & distribute Packaging and shipping algorithms

4.1.1. Input & output#

The algorithm runs in an isolated environment within the data station (node). As it is important to limit the connectivity and accessability for obvious security reasons. In order for the algorithm to do its work, it is provided with several resources.

Note

This section describes the current process. Keep in mind that this is subjected to be changed. For more information, please see this Github

Environment variables#

The algorithms have access to several environment variables, see Environment variables. These can be used to locate certain files or to add local configuration settings into the container.

Table 4.1 Environment variables#

Variable

Description

INPUT_FILE

path to the input file. The input file contains the user defined input for the algorithms.

TOKEN_FILE

Path to the token file. The token file contains a JWT token which can be used to access the vantage6-server. This way the algorithm container is able to post new tasks and retrieve results.

TEMPORARY_FOLDER

Path to the temporary folder. This folder can be used to store intermediate results. These intermediate results are shared between all containers that have the same run_id. Algorithm containers which are created from an algorithm container themselves share the same run_id.

HOST

Contains the URL to the vantage6-server.

PORT

Contains the port to which the vantage6-server listens. Is used in combination with HOST and API_PATH.

API_PATH

Contains the api base path from the vantage6-server.

[*]_DATABASE_URI

Contains the URI of the local database. The * is replaced by the key specified in the node configuration file.

Note

Additional environment variables can be specified in the node configuration file using the algorithm_env key. These additional variables are forwarded to all algorithm containers.

File mounts#

The algorithm container has access to several file mounts.

Input

The input file contains the user defined input. The user specifies this when a task is created.

Output

The algorithm should write its output to this file. When the docker container exits the contents of this file will be send back to the vantage6-server.

Token

The token file contains a JWT token which can be used by the algorithm to communicate with the central server. The token can only be used to create a new task with the same image, and is only valid while the task has not yet been completed.

Temporary directory

The temporary directory can be used by an algorithm container to share files with other algorithm containers that:

  • run on the same node

  • have the same run_id

Algorithm containers that origin from another container (a.k.a master container or parent container) share the same run_id. i.o. if a user creates a task a new run_id is assigned.

The paths to these files and directories are stored in the environment variables, which we will explain now.

4.1.2. Wrappers#

The algorithm wrapper simplifies and standardizes the interaction between algorithm and node. The client libraries and the algorithm wrapper are tied together and use the same standards. The algorithm wrapper:

  • reads the environment variables and file mounts and supplies these to your algorithm.

  • provides an entrypoint for the docker container

  • allows to write a single algorithm for multiple types of data sources

The wrapper is language specific and currently we support Python and R. Extending this concept to other languages is not so complex.

../_images/algorithm_wrapper.png

Fig. 4.1 The algorithm wrapper handles algorithm input and output.#

Federated function#

The signature of your function has to contain data as the first argument. The method name should have a RPC_ prefix. Everything that is returned by the function will be written to the output file.

Python:

def RPC_my_algorithm(data, *args, **kwargs):
    pass

R:

RPC_my_algorithm <- function(data, ...) {
}

Central function#

It is quite common to have a central part of your federated analysis which orchestrates the algorithm and combines the partial results. A common pattern for a central function would be:

  1. Request partial models from all participants

  2. Obtain the partial models

  3. Combine the partial models to a global model

  4. (optional) Repeat step 1-3 until the model converges

It is possible to run the central part of the analysis on your own machine, but it is also possible to let vantage6 handle the central part. There are several advantages to letting vantage6 handle this:

  • You don’t have to keep your machine running during the analysis

  • You don’t need to use the same programming language as the algorithm in case a language specific serialization is used in the algorithm

Note

Central functions also run at a node and not at the server.

In contrast to the federated functions, central functions are not prefixed. The first argument needs to be client and the second argument needs to be data. The data argument contains the local data and the client argument provides an interface to the vantage6-server.

Warning

The argument data is not present in the R wrapper. This is a consistency issue which will be solved in a future release.

Example central function in Python
def main(client, data, *args, **kwargs):
   # Run a federated function. Note that we omnit the
   # RPC_ prefix. This prefix is added automatically
   # by the infrastructure
   task = client.create_new_task(
      {
         "method": "my_algorithm",
         "args": [],
         "kwargs": {}
      },
      organization_ids=[...]
   )

    # wait for the federated part to complete
    # and return
    results = wait_and_collect(task)

    return results

Example central function in R
main <- function(client, ...) {
    # Run a federated function. Note that we omnit the
    # RPC_ prefix. This prefix is added automatically
    # by the infrastructure
    result <- client$call("my_algorithm", ...)

    # Optionally do something with the results

    # return the results
    return(result)
}

Different wrappers#

The docker wrappers read the local data source and supplies this to your functions in your algorithm. Currently CSV and SPARQL for Python and a CSV wrapper for R is supported. Since the wrapper handles the reading of the data, you need to rebuild your algorithm with a different wrapper to make it compatible with a different type of data source. You do this by updating the CMD directive in the dockerfile.

CSV wrapper (Python)

...
CMD python -c "from vantage6.tools.docker_wrapper import docker_wrapper; docker_wrapper('${PKG_NAME}')"

CSV wrapper (R)

...
CMD Rscript -e "vtg::docker.wrapper('$PKG_NAME')"

SPARQL wrapper (Python)

...
CMD python -c "from vantage6.tools.docker_wrapper import sparql_wrapper; sparql_wrapper('${PKG_NAME}')"

Parquet wrapper (Python)

...
CMD python -c "from vantage6.tools.docker_wrapper import parquet_wrapper; parquet_wrapper('${PKG_NAME}')"

Data serialization#

TODO

4.1.3. Mock client#

TODO

4.1.4. Child containers#

When a user creates a task, one or more nodes spawn an algorithm container. These algorithm containers can create new tasks themselves.

Every algorithm is supplied with a JWT token (see Input & output). This token can be used to communicate with the vantage6-server. In case you use a algorithm wrapper, you simply can use the supplied Client in case you use a Central function.

A child container can be a parent container itself. There is no limit to the amount of task layers that can be created. It is common to have only a single parent container which handles many child containers.

../_images/container_hierarchy.png

Fig. 4.2 Each container can spawn new containers in the network. Each container is provided with a unique token which they can use to communicate to the vantage6-server.#

The token to which the containers have access supplies limited permissions to the container. For example, the token can be used to create additional tasks, but only in the same collaboration, and using the same image.

4.1.5. Networking#

The algorithm container is deployed in an isolated network to reduce their exposure. Hence, the algorithm it cannot reach the internet. There are two exceptions:

  1. When the VPN feature is enabled on the server all algorithm containers are able to reach each other using an ip and port over VPN.

  2. The central server is reachable through a local proxy service. In the algorithm you can use the HOST, POST and API_PATH to find the address of the server.

Note

We are working on a whitelisting feature which allows a node to configure addresses that the algorithm container is able to reach.

VPN connection#

Algorithm containers can expose one or more ports. These ports can then be used by other algorithm containers to exchange data. The infrastructure uses the Dockerfile from which the algorithm has been build to determine to which ports are used by the algorithm. This is done by using the EXPOSE and LABEL directives.

For example when an algorithm uses two ports, one port for communication com and one port for data exchange data. The following block should be added to you algorithm Dockerfile:

# port 8888 is used by the algorithm for communication purposes
EXPOSE 8888
LABEL p8888 = "com"

# port 8889 is used by the algorithm for data-exchange
EXPOSE 8889
LABEL p8889 = "data"

Port 8888 and 8889 are the internal ports to which the algorithm container listens. When another container want to communicate with this container it can retrieve the IP and external port from the central server by using the result_id and the label of the port you want to use (com or data in this case)

4.1.6. Cross language#

Because algorithms are exchanged through Docker images they can be written in any language. This is an advantage as developers can use their preferred language for the problem they need to solve.

Warning

The wrappers are only available for R and Python, so when you use different language you need to handle the IO yourself. Consult the Input & Output section on what the node supplies to your algorithm container.

When data is exchanged between the user and the algorithm they both need to be able to read the data. When the algorithm uses a language specific serialization (e.g. a pickle in the case of Python or RData in the case of R) the user needs to use the same language to read the results. A better solution would be to use a type of serialization that is not specific to a language. For our wrappers we use JSON for this purpose.

Note

Communication between algorithm containers can use language specific serialization as long as the different parts of the algorithm use the same language.

4.1.7. Package & distribute#

Once the algorithm is completed it needs to be packaged and made available for retrieval by the nodes. The algorithm is packaged in a Docker image. A Docker image is created from a Dockerfile, which acts as blue-print. Once the Docker image is created it needs to be uploaded to a registry so that nodes can retrieve it.

Dockerfile#

A minimal Dockerfile should include a base image, injecting your algorithm and execution command of your algorithm. Here are several examples:

Example Dockerfile
# python3 image as base
FROM python:3

# copy your algorithm in the container
COPY . /app

# maybe your algorithm is installable.
RUN pip install /app

# execute your application
CMD python /app/app.py

Example Dockerfile with Python wrapper

When using the Python Wrappers, the Dockerfile needs to follow a certain format. You should only change the PKG_NAME value to the Python package name of your algorithm.

# python vantage6 algorithm base image
FROM harbor.vantage6.ai/algorithms/algorithm-base

# this should reflect the python package name
ARG PKG_NAME="v6-summary-py"

# install federated algorithm
COPY . /app
RUN pip install /app

ENV PKG_NAME=${PKG_NAME}

# Tell docker to execute `docker_wrapper()` when the image is run.
CMD python -c "from vantage6.tools.docker_wrapper import docker_wrapper; docker_wrapper('${PKG_NAME}'

Note

When using the python wrapper your algorithm file needs to be installable. See here for more information on how to create a python package.


Example Dockerfile with R wrapper

When using the R Wrappers, the Dockerfile needs to follow a certain format. You should only change the PKG_NAME value to the R package name of your algorithm.

# The Dockerfile tells Docker how to construct the image with your algorithm.
# Once pushed to a repository, images can be downloaded and executed by the
# network hubs.
FROM harbor2.vantage6.ai/base/custom-r-base

# this should reflect the R package name
ARG PKG_NAME='vtg.package'

LABEL maintainer="Main Tainer <m.tainer@vantage6.ai>"

# Install federated glm package
COPY . /usr/local/R/${PKG_NAME}/

WORKDIR /usr/local/R/${PKG_NAME}
RUN Rscript -e 'library(devtools)' -e 'install_deps(".")'
RUN R CMD INSTALL --no-multiarch --with-keep.source .

# Tell docker to execute `docker.wrapper()` when the image is run.
ENV PKG_NAME=${PKG_NAME}
CMD Rscript -e "vtg::docker.wrapper('$PKG_NAME')"

Note

Additional Docker directives are needed when using direct communication between different algorithm containers, see Networking.

Build & upload#

If you are in the folder containing the Dockerfile, you can build the project as follows:

docker build -t repo/image:tag .

The -t indicated the name of your image. This name is also used as reference where the image is located on the internet. If you use Docker hub to store your images, you only specify your username as repo followed by your image name and tag: USERNAME/IMAGE_NAME:IMAGE_TAG. When using a private registry repo should contain the URL of the registry also: e.g. harbor2.vantage6.ai/PROJECT/IMAGE_NAME:TAG.

Then you can push you image:

docker push repo/image:tag

Now that is has been uploaded it is available for nodes to retrieve when they need it.

Signed images#

It is possible to use the Docker the framework to create signed images. When using signed images, the node can verify the author of the algorithm image adding an additional protection layer.

Dockerfile

  • Build project

  • CMD

  • Expose