The project pipeline

This section covers how to develop, run, and test your code to ensure it will work end-to-end within the secure framework.

Project pipelines🔗

The ehrQL documentation describes how to make an action which generate dummy datasets based on the instructions defined in your dataset_definition.py script. These dummy datasets are the basis for developing the analysis code that will eventually be passed to the server to run on real datasets. The code can be written and run on your local machine using whatever development set up you prefer (e.g., developing R in RStudio). However, it's important to ensure that this code will run successfully in OpenSAFELY's secure environment too, using the specific language and package versions that are installed there. To do this, you should use the project pipeline.

The project pipeline, defined entirely in a project.yaml file, is a system for executing your code using a series of actions i.e., a discrete analytical step within the analysis, each of which may depend on previous actions.

The primary purpose of the pipeline is to specify the execution order for all your code, so that it can be automatically run and tested from start to finish using dummy data and using the live database in the secure environment, using an identical software configuration. Arranging your code like this also has several other advantages:

The pipeline knows if outputs for given actions already exist, and by default skips running them if so. This greatly speeds up the debugging cycle when testing against live data
In production, actions that can be executed in parallel will be, automatically
Thinking about your analysis in terms of actions makes it more readable and therefore easier to review and test. For example, being explicit about what the inputs and outputs of each actions are ensures you don't overwrite files by accident.
The pipeline forces you to declare which outputs may be more or less disclosive.

`project.yaml` format🔗

The project pipeline is defined in a single file, project.yaml, which lives in the repository's root directory. It is written using a configuration format called YAML, which uses indentation to indicate groupings of related variables.

A simple example of a project.yaml is as follows:

version: "4.0"

actions:
  generate_dataset:
    run: ehrql:v1 generate-dataset analysis/dataset_definition.py --output output/dataset.csv.gz
    outputs:
      highly_sensitive:
        dataset: output/dataset.csv.gz

  run_model:
    run: stata-mp:latest analysis/model.do
    needs: [generate_dataset]
    outputs:
      moderately_sensitive:
        model: models/cox-model.txt
        figure: figures/survival-plot.png

This example declares the pipeline version, and two actions: generate_dataset and run_model.

You only need to change version if you want to take advantage of features of newer versions of the pipeline framework.

The generate_dataset action will create the highly sensitive dataset.csv.gz dataset. It will be dummy data when run locally, and will be based on real data from the OpenSAFELY database when run in the secure environment. The run_model action will run a Stata script called model.do based on the dataset.csv.gz created by the previous action. It will output two moderately sensitive files cox-model.txt and survival-plot.png, which can be checked and released if appropriate.

Every project.yaml requires a version and an actions section. In general, actions are composed as follows:

Each action must be named using a valid YAML key (you won't go wrong with letters, numbers, and underscores) and must be unique.
Each action must include a run key which includes an officially-supported command and a version (which at present is usually just latest).
- The ehrql command has the same options as described in the ehrQL reference.
- The python, r, and stata-mp commands provide a locked-down execution environment that can take one or more inputs which are passed to the code.
Each action must include an outputs key with at least one output, classified as either highly_sensitive or moderately_sensitive
- highly_sensitive outputs are considered potentially highly-disclosive, and are never intended for publishing outside the secure environment. This includes all data at the pseudonymised patient-level. Outputs labelled highly_sensitive will not be visible to researchers.
- moderately_sensitive outputs should never include patient-level data, only data that is considered non-disclosive. This includes aggregated patient-data outputs such as summary tables, summary statistics and the outputs from statistical models. For a full list, check the allowed file types subsection. The appropriate statistical disclosure controls should have been applied to these files. They are copied to the secure review area (otherwise known as Level 4).
- Outputs should be separated onto different lines, each with a unique 'key', but related outputs can be combined using a wildcard (*). Note, when using a wildcard, it is extremely important to ensure that no highly_sensitive outputs are included. E.g.:
```
   outputs:
      moderately_sensitive:
        table: output/summary_results.txt
        survival_figure: output/figures/survival-plot.png
        time_series_figures: output/figures/time_series_*.png
```
- Keys serve only as a human-readable description of the outputs, and are ignored when the job is run.
Each action can include a needs key which specifies a list of actions (contained within square brackets and separated by commas) that are required for it to successfully run. When an action runs, the outputs of all its needs actions are copied to its working directory. needs actions can be defined anywhere in the project.yaml, but it's more readable if they are defined above.

When writing and running your pipeline, note that:

All file paths must be declared relative to the repository's root directory. So for example use outputs/figures/, not C:/users/elvis/documents/myrepo/outputs/figures.
File paths are case-sensitive as everything is run inside a Linux Docker container.
The location of each action's output is determined by the underlying code that the action invoked, not by the value of the outputs configuration. The purpose of outputs is to label the disclosivity of each output and indicate that it should be stored securely — any outputs not labelled will not be saved.
Each action is run in its own isolated environment in a temporary working directory. This means that all the necessary libraries and data must be imported within the script for each action — For R users, this essentially means that the R is restarted for each action.
If one or more dependencies of an action have not been run (i.e., their outputs do not exist) then these dependency actions will be run first. If a dependency has changed but has not been run (so the outputs are not up-to-date with the changes), then the dependency actions will not be run, and the dependent actions will be run using the out-of-date outputs.
The ordering of columns may not be consistent between the dummy data and the TPP/EMIS backend. You should avoid referring to index integer positions and instead use the index / column names. Using index / column names will be more robust to different versions of ehrQL and will also avoid problems caused by index integer positions changing as columns are added/removed.

Running your code locally🔗

Whilst you can develop and run code locally using your own installations of R, Stata or Python, it's important to check that these will also successfully run on the real data in an identical execution environment.

The opensafely run command will execute one or more actions according to the project.yaml. To see its options, type opensafely run --help.

For opensafely run to work:

You need to have both Python and Docker installed.
The Docker daemon must be running on your machine:
- For Windows users using Docker Desktop, there should be a Docker icon in your system tray.
- For Mac users using Docker Desktop, there should be a Docker icon in the top status bar.

To run the first action in the example above, using dummy data, you can use:

opensafely run generate_dataset

This will generate the dataset.csv.gz file as explained in the ehrQL documentation.

To run the second action you can use:

opensafely run run_model

It will create the two files as specified in the analysis/model.do script.

To force the dependencies to be run you can use for example opensafely run run_model --force-run-dependencies, or -f for short. This will ensure for example that both the run_model and generate_dataset actions are run, even if dataset.csv.gz already exists.

To run all actions, you can use a special run_all action which is created for you (no need to define it in your project.yaml):

opensafely run run_all

Each time an action is run, logging information about your run will be put into the metadata/ folder. If any of your actions fail, you may find clues here as to why.

Click here for information on the exact steps that occur when each job is run locally

What happens:

A new, empty temporary directory for the job is created
Any files in the local repo that do not match the output patterns in the project.yaml are copied into the temporary folder
Any output files from the job's dependencies are copied into the temporary folder
The job is run
All the files matching the specified output patterns are copied into the local repo
The log files for the job are saved into the metadata/ directory
The temporary directory is deleted

Running your code with GitHub Actions🔗

Every time you create a pull request to merge a development branch onto the main remote branch, GitHub will automatically run a series of tests on the code; specifically, that your codelists are up-to-date, and that run_all completes successfully. Depending on your settings, you may receive email notifications about the results of these tests. You can view the tests, including any errors or failures, by going to the pull request page on GitHub and clicking the checks tab.

You can re-run these tests by clicking the re-run jobs button.

Running your code on the server🔗

To run code for real in the production environment, use the jobs site.

Accessing outputs🔗

After your project has been executed via the jobs site, its outputs will be stored on a secure server.

Users with permission to access Level 4 can view output files that are labelled as moderately sensitive; they can also view automatically created log files of the run for debugging purposes.

For security reasons, they will be in a different directory than if you had run locally. For the TPP backend, outputs labelled moderately_sensitive in the project.yaml will be saved in D:/Level4Files/workspaces/<NAME_OF_YOUR_WORKSPACE>. These outputs can be reviewed on the server and released if they are deemed non-disclosive.

Outputs labelled highly_sensitive are not visible.

If you have Level 3 access🔗

No data should ever be published from the Level 3 server. Access is only for permitted users, for the purpose of debugging problems in the secure environment.

Highly sensitive outputs can be seen in E:/high_privacy/workspaces/<WORKSPACE_NAME>. This includes a directory called metadata, containing log files for each action e.g. generate_dataset.log, run_model.log.

Moderately sensitive outputs can be seen in E:/FILESFORL4/workspaces/<WORKSPACE_NAME>.