Tutorial

This tutorial walks through the installation of phenopacket-tools command-line interface application and provides an overview of the conversion of phenopackets from v1 to the current v2 format and the validation functionality, including custom validation rules.

We show phenopacket-tools functionality within a Unix shell due to the fact that Unix is the predominant computational environment in bioinformatics. We use several Unix-specific concepts such as I/O redirection to demonstrate use of phenopacket-tools in a process pipeline, and we declare environment variables and command aliases to reduce the amount of boilerplate code. Note that phenopacket-tools is a cross-platform tool and will work on Windows shells with the appropriate adjustments.

Setup

Phenopacket-tools is written in Java 17 and requires Java 17 or better to run. We distribute the CLI application as a ZIP archive with an executable Java Archive (JAR) file and several examples for running this tutorial.

Prerequisites

Java 17 or newer must be present on your $PATH. Run the following to check the availability and version of Java on your machine:

java -version

The command should print a similar output:

openjdk version "17" 2021-09-14
OpenJDK Runtime Environment (build 17+35-2724)
OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)

Download phenopacket-tools

A prebuilt distribution ZIP file is available for download from the release section of our GitHub repository. Use your favorite web browser to download the ZIP archive with the latest release 1.0.0-RC3 and unpack the archive into a folder of your choice.

Set up alias

In general, Java command line applications are invoked as java -jar executable.jar. However, such incantation is a bit too verbose and we can shorten it a bit by defining an alias.

Let’s define a command alias for phenopacket-tools. Assuming the distribution ZIP was unpacked into phenopacket-tools-cli-1.0.0-RC3 directory, run the following to set up and check the command alias:

alias pxf="java -jar $(pwd)/phenopacket-tools-cli-1.0.0-RC3/phenopacket-tools-cli-1.0.0-RC3.jar"
pxf --help

Note

From now on, we will use the pxf alias instead of the longer form.

Set up examples directory

We will demonstrate phenopacket-tools functionality using a collection of example phenopackets that are bundled in the distribution ZIP file. The folder with the phenopacket collection resides next to the phenopacket-tools JAR file and has the following structure:

examples
  |- convert
    \ - Schreckenbach-2014-TPM3-II.2.json
  |- phenopackets
    \ - retinoblastoma.json
  \- validate
    | - ...
    \ - ...

To reduce the amount of boilerplate code in the following sections, let’s define an environment variable to point to the example phenopacket directory:

examples=path/to/examples

Make sure you set the variable to the actual path in your environment.

Note

See Example phenopackets for detailed info of the example phenopackets.

Set up autocompletion

As a quick way to increase the user convenience, phenopacket-tools offers autocompletion for completing the command or options after pressing the TAB key on Bash or ZSH Unix shells.

Run the following to enable the autocompletion for the tutorial session:

source <(pxf generate-completion)

Note

See the Command-line interface for setting up the autocompletion to last beyond the current shell session.

Convert

Version 1 of the GA4GH Phenopacket schema was released in 2019 to elicit community feedback. In response to this feedback, the schema was extended and refined and version 2 was released in 2021 and published in 2022 by the International Standards Organization (ISO). The convert command of phenopacket-tools converts version 1 phenopackets into version 2.

For the purpose of this tutorial, we will first convert a single v1 phenopacket and then 384 v1 phenopackets published by Robinson et al., 2020[1].

Convert single phenopacket

Due to differences between phenopacket versions 1 and 2, there are two ways how to convert v1 phenopackets into the v2 format. Briefly, the conversion either assumes that the Variants are causal with respect to a Disease of the v1 phenopacket, or skips conversion of Variants altogether. The logic is controlled with --convert-variants CLI option and the conversion can be done iff the v1 phenopacket has one Disease.

Note

See the Converting v1 Phenopackets section for more information.

Let’s convert an example v1 phenopacket Schreckenbach-2014-TPM3-II.2.json to v2 format:

pxf convert ${examples}/convert/Schreckenbach-2014-TPM3-II.2.json > Schreckenbach-2014-TPM3-II.2.v2.json

The example phenopacket represents a case report with several variants that are causal with respect to the disease. Therefore, we can use --convert-variants to convert Variants into v2 Interpretation element:

pxf convert --convert-variants ${examples}/convert/Schreckenbach-2014-TPM3-II.2.json \
  > Schreckenbach-2014-TPM3-II.2.v2-with-variants.json

A real-life example

Let’s convert 384 individuals described in published case reports with Human Phenotype Ontology terms, causal genetic variants, and OMIM disease identifiers.

Let’s start by downloading and unpacking the phenopacket dataset. The phenopacket dataset is available for download from Zenodo[2]. Then, we extract the archive content into a folder named as v1:

curl -o phenopackets.v1.zip https://zenodo.org/record/3905420/files/phenopackets.zip
unzip -d v1 phenopackets.v1.zip

Now, we convert all v1 phenopackets and store the results in JSON format in a new folder v2:

# Create a folder for the v2 phenopackets.
mkdir -p v2

# Convert the phenopackets.
for pp in $(find v1 -name "*.json"); do
  pp_name=$(basename ${pp})
  pxf convert --convert-variants ${pp} > v2/${pp_name}
done

printf "Converted %s phenopackets\n" $(ls v2/ | wc -l)

Validate

The validate command of phenopacket-tools validates correctness of phenopackets, families and cohorts. This section outlines usage of the off-the-shelf validators available in the CLI application.

We will describe each validation and show an example validation errors and a proposed solution in a table. The validation examples use Phenopackets, but the validation functionality is available for all top-level Phenopacket Schema elements, including Cohort and Family.

The validation is implemented for v2 phenopackets only. The v1 phenopackets must be converted to v2 prior running validation.

Base validation

First, let’s check if the phenopackets meet the base requirements, as described by the Phenopacket Schema. All phenopackets, regardless of their aim or scope must pass this requirement to be valid.

Note

See Base validation workflow for more details.

All required fields must be present

The BaseValidator checks that all required fields are present:

pxf validate ${examples}/validate/base/missing-fields.json

The validator will find 3 errors and emit 3 CSV lines with the following issues:

Validation error

Solution

‘id’ is missing but it is required

Add the phenopacket ID

‘subject.id’ is missing but it is required

Add the subject ID

‘phenotypicFeatures[0].type.label’ is missing but it is required

Add the label attribute into the type of the first phenotypic feature

Note

The validate command reports errors in CSV format the validation results can be easily stored in a CSV file by using output stream redirection. Use the -H | --include-header option to include a header with validation metadata.

All ontologies are well-defined

Phenopacket Schema relies heavily on use of ontologies and ontology concepts. MetaData element lists the ontologies used in the particular phenopacket. To ensure data traceability, Phenopacket Schema requires phenopacket to contain a Resource with ontology metadata such as version and IRI for each used ontology concept.

The MetaDataValidator checks if the MetaData has an ontology Resource for all used ontology concepts:

pxf validate ${examples}/validate/base/missing-resources.json

The validator points out the absence of NCBITaxon definition:

Validation error

Solution

No ontology corresponding to ID ‘NCBITaxon:9606’ found in MetaData

Add a Resource element with NCBITaxon definition into MetaData

Custom validation rules

Projects or consortia can enforce specific requirements by designing a custom JSON schema. For instance, a rare disease project may require presence of several elements that are not required by the default schema:

  1. Subject (proband being investigated)

  2. At least one PhenotypicFeature element and using HPO terms for phenotypic features

  3. Time at last encounter (sub-element of subject), representing the age of the proband

Phenopacket-tools ships with a JSON schema for enforcing the above requirements. The schema is located next to phenopacket examples for this section at examples/custom-json-schema/hpo-rare-disease-schema.json.

Using the custom JSON schema via --require option will point out issues in the 4 example phenopackets:

pxf validate --require ${examples}/validate/custom-json-schema/hpo-rare-disease-schema.json \
  ${examples}/validate/custom-json-schema/marfan.no-subject.json \
  ${examples}/validate/custom-json-schema/marfan.no-phenotype.json \
  ${examples}/validate/custom-json-schema/marfan.not-hpo.json \
  ${examples}/validate/custom-json-schema/marfan.no-time-at-last-encounter.json

Validation error

Solution

‘subject’ is missing but it is required

Add the Subject element

‘phenotypicFeatures’ is missing but it is required

Add at least one PhenotypicFeature

‘phenotypicFeatures[0].type.id’ does not match the regex pattern ^HP:\d{7}$

Use Human Phenotype Ontology in PhenotypicFeatures

‘subject.timeAtLastEncounter’ is missing but it is required

Add the time at last encounter field

Note

See Custom validation for more details.

Phenotype validation

Phenopacket-tools offers a validator for checking logical consistency of clinical abnormalities in the phenopacket. The validator assumes Human Phenotype Ontology (HPO) is used to represent the clinical abnormalities and the phenotype validation requires the HPO file to work.

Note

The examples below assume that the latest HPO in JSON format has been downloaded to hp.json. Get the HPO JSON from HPO releases.

Note

See Phenotype validators for more details.

Phenopackets use non-obsolete term IDs

The HpoPhenotypeValidator points out if the phenopacket contains obsolete HPO terms:

pxf validate --hpo hp.json ${examples}/validate/phenotype-validation/marfan.obsolete-term.json

It turns out that marfan.obsolete-term.json uses an obsolete HP:0002631 instead of the primary HP:0002616 for Aortic root aneurysm:

Validation error

Solution

Using obsolete id (HP:0002631) instead of current primary id (HP:0002616) in id-C

Replace the obsolete ID with the primary ID

The annotation-propagation rule is not violated

Due to the annotation propagation rule, it is a logical error to use both a term and its ancestor (e.g. Arachnodactyly and Abnormality of finger) for annotation of a single item. When choosing HPO terms for phenotypic features, the most specific terms should be used for the observed clinical features. In contrary, the least specific terms should be used for the excluded clinical features. There is one exception to these rules: a term and its ancestor can co-exist in the phenopacket if the parent term is observed and the child term is excluded (e.g. phenopacket with present Aortic aneurysm but excluded Aortic root aneurysm, see marfan.valid.json).

The HpoAncestryValidator checks that the annotation propagation rule is not violated:

pxf validate --hpo hp.json \
${examples}/validate/phenotype-validation/marfan.annotation-propagation-rule.json \
${examples}/validate/phenotype-validation/marfan.valid.json

Validation error

Solution

Phenotypic features of id-C must not contain both an observed term (Aortic root aneurysm, HP:0002616) and an observed ancestor (Aortic aneurysm, HP:0004942)

Remove the ancestor term

Annotation of organ systems

We can validate presence of annotation for specific organ systems in a phenopacket.

Using the term IDs of the top-level HPO terms, we can validate annotation of Eye, Cardiovascular, and Respiratory organ systems in 3 phenopackets of toy Marfan syndrome patients:

pxf validate --hpo hp.json \
   --organ-system HP:0000478 --organ-system HP:0001626 --organ-system HP:0002086 \
  ${examples}/validate/organ-systems/marfan.all-organ-system-annotated.json \
  ${examples}/validate/organ-systems/marfan.missing-eye-annotation.json \
  ${examples}/validate/organ-systems/marfan.no-abnormalities.json

Note

Organ system validation requires HPO ontology. See the Phenotype validation for more details about getting the HPO file.

The HpoOrganSystemValidator will point out one error in the marfan.missing-eye-annotation.json phenopacket:

Validation error

Solution

Missing annotation for Abnormality of the eye [HP:0000478] in id-C

Annotate the eye or exclude any abnormality.

Note

See Organ system validation for more details regarding the organ system validation.

That’s it, you made it to the end of the phenopacket-tools tutorial! We set up the command-line application and covered the conversion and validation functionality. The next section provides an in-depth explanation of the CLI functionality.