Tutorial
This tutorial walks through the installation of phenopacket-tools command-line interface application and provides an overview of the conversion of phenopackets from v1 to the current v2 format and the validation functionality, including custom validation rules.
We show phenopacket-tools functionality within a Unix shell due to the fact that Unix is the predominant computational environment in bioinformatics. We use several Unix-specific concepts such as I/O redirection to demonstrate use of phenopacket-tools in a process pipeline, and we declare environment variables and command aliases to reduce the amount of boilerplate code. Note that phenopacket-tools is a cross-platform tool and will work on Windows shells with the appropriate adjustments.
Setup
Phenopacket-tools is written in Java 17 and requires Java 17 or better to run. We distribute the CLI application as a ZIP archive with an executable Java Archive (JAR) file and several examples for running this tutorial.
Prerequisites
Java 17 or newer must be present on your $PATH
. Run the following to check the availability
and version of Java on your machine:
java -version
The command should print a similar output:
openjdk version "17" 2021-09-14
OpenJDK Runtime Environment (build 17+35-2724)
OpenJDK 64-Bit Server VM (build 17+35-2724, mixed mode, sharing)
Download phenopacket-tools
A prebuilt distribution ZIP file is available for download from the release section of our GitHub repository. Use your favorite web browser to download the ZIP archive with the latest release 1.0.0-RC3 and unpack the archive into a folder of your choice.
Set up alias
In general, Java command line applications are invoked as java -jar executable.jar
. However, such incantation is
a bit too verbose and we can shorten it a bit by defining an alias.
Let’s define a command alias for phenopacket-tools. Assuming the distribution ZIP was unpacked into phenopacket-tools-cli-1.0.0-RC3 directory, run the following to set up and check the command alias:
alias pxf="java -jar $(pwd)/phenopacket-tools-cli-1.0.0-RC3/phenopacket-tools-cli-1.0.0-RC3.jar" pxf --help
Note
From now on, we will use the pxf
alias instead of the longer form.
Set up examples directory
We will demonstrate phenopacket-tools functionality using a collection of example phenopackets that are bundled in the distribution ZIP file. The folder with the phenopacket collection resides next to the phenopacket-tools JAR file and has the following structure:
examples
|- convert
\ - Schreckenbach-2014-TPM3-II.2.json
|- phenopackets
\ - retinoblastoma.json
\- validate
| - ...
\ - ...
To reduce the amount of boilerplate code in the following sections, let’s define an environment variable to point to the example phenopacket directory:
examples=path/to/examples
Make sure you set the variable to the actual path in your environment.
Note
See Example phenopackets for detailed info of the example phenopackets.
Set up autocompletion
As a quick way to increase the user convenience, phenopacket-tools offers autocompletion for completing the command or options after pressing the TAB key on Bash or ZSH Unix shells.
Run the following to enable the autocompletion for the tutorial session:
source <(pxf generate-completion)
Note
See the Command-line interface for setting up the autocompletion to last beyond the current shell session.
Convert
Version 1 of the GA4GH Phenopacket schema was released in 2019 to elicit community feedback. In response to this feedback, the schema was extended and refined and version 2 was released in 2021 and published in 2022 by the International Standards Organization (ISO). The convert command of phenopacket-tools converts version 1 phenopackets into version 2.
For the purpose of this tutorial, we will first convert a single v1 phenopacket and then 384 v1 phenopackets published by Robinson et al., 2020[1].
Convert single phenopacket
Due to differences between phenopacket versions 1 and 2, there are two ways how to convert v1 phenopackets into
the v2 format.
Briefly, the conversion either assumes that the Variants are causal with respect to a Disease of the
v1 phenopacket, or skips conversion of Variants altogether. The logic is controlled with --convert-variants
CLI option and the conversion can be done iff the v1 phenopacket has one Disease.
Note
See the Converting v1 Phenopackets section for more information.
Let’s convert an example v1 phenopacket Schreckenbach-2014-TPM3-II.2.json
to v2 format:
pxf convert ${examples}/convert/Schreckenbach-2014-TPM3-II.2.json > Schreckenbach-2014-TPM3-II.2.v2.json
The example phenopacket represents a case report with several variants that are causal with respect to the disease.
Therefore, we can use --convert-variants
to convert Variants into v2 Interpretation element:
pxf convert --convert-variants ${examples}/convert/Schreckenbach-2014-TPM3-II.2.json \
> Schreckenbach-2014-TPM3-II.2.v2-with-variants.json
A real-life example
Let’s convert 384 individuals described in published case reports with Human Phenotype Ontology terms, causal genetic variants, and OMIM disease identifiers.
Let’s start by downloading and unpacking the phenopacket dataset.
The phenopacket dataset is available for download from Zenodo[2]. Then, we extract the archive content into
a folder named as v1
:
curl -o phenopackets.v1.zip https://zenodo.org/record/3905420/files/phenopackets.zip
unzip -d v1 phenopackets.v1.zip
Now, we convert all v1 phenopackets and store the results in JSON format in a new folder v2
:
# Create a folder for the v2 phenopackets.
mkdir -p v2
# Convert the phenopackets.
for pp in $(find v1 -name "*.json"); do
pp_name=$(basename ${pp})
pxf convert --convert-variants ${pp} > v2/${pp_name}
done
printf "Converted %s phenopackets\n" $(ls v2/ | wc -l)
Validate
The validate command of phenopacket-tools validates correctness of phenopackets, families and cohorts. This section outlines usage of the off-the-shelf validators available in the CLI application.
We will describe each validation and show an example validation errors and a proposed solution in a table. The validation examples use Phenopackets, but the validation functionality is available for all top-level Phenopacket Schema elements, including Cohort and Family.
The validation is implemented for v2 phenopackets only. The v1 phenopackets must be converted to v2 prior running validation.
Base validation
First, let’s check if the phenopackets meet the base requirements, as described by the Phenopacket Schema. All phenopackets, regardless of their aim or scope must pass this requirement to be valid.
Note
See Base validation workflow for more details.
All required fields must be present
The BaseValidator checks that all required fields are present:
pxf validate ${examples}/validate/base/missing-fields.json
The validator will find 3 errors and emit 3 CSV lines with the following issues:
Validation error |
Solution |
---|---|
‘id’ is missing but it is required |
Add the phenopacket ID |
‘subject.id’ is missing but it is required |
Add the subject ID |
‘phenotypicFeatures[0].type.label’ is missing but it is required |
Add the label attribute into the type of the first phenotypic feature |
Note
The validate
command reports errors in CSV format the validation results can be easily stored in a CSV file by
using output stream redirection. Use the -H | --include-header
option to include a header
with validation metadata.
All ontologies are well-defined
Phenopacket Schema relies heavily on use of ontologies and ontology concepts. MetaData element lists the ontologies used in the particular phenopacket. To ensure data traceability, Phenopacket Schema requires phenopacket to contain a Resource with ontology metadata such as version and IRI for each used ontology concept.
The MetaDataValidator checks if the MetaData has an ontology Resource for all used ontology concepts:
pxf validate ${examples}/validate/base/missing-resources.json
The validator points out the absence of NCBITaxon definition:
Validation error |
Solution |
---|---|
No ontology corresponding to ID ‘NCBITaxon:9606’ found in MetaData |
Add a Resource element with NCBITaxon definition into MetaData |
Custom validation rules
Projects or consortia can enforce specific requirements by designing a custom JSON schema. For instance, a rare disease project may require presence of several elements that are not required by the default schema:
Subject (proband being investigated)
At least one PhenotypicFeature element and using HPO terms for phenotypic features
Time at last encounter (sub-element of subject), representing the age of the proband
Phenopacket-tools ships with a JSON schema for enforcing the above requirements.
The schema is located next to phenopacket examples for this section
at examples/custom-json-schema/hpo-rare-disease-schema.json
.
Using the custom JSON schema via --require
option will point out issues in the 4 example phenopackets:
pxf validate --require ${examples}/validate/custom-json-schema/hpo-rare-disease-schema.json \
${examples}/validate/custom-json-schema/marfan.no-subject.json \
${examples}/validate/custom-json-schema/marfan.no-phenotype.json \
${examples}/validate/custom-json-schema/marfan.not-hpo.json \
${examples}/validate/custom-json-schema/marfan.no-time-at-last-encounter.json
Validation error |
Solution |
---|---|
‘subject’ is missing but it is required |
Add the Subject element |
‘phenotypicFeatures’ is missing but it is required |
Add at least one PhenotypicFeature |
‘phenotypicFeatures[0].type.id’ does not match the regex pattern |
Use Human Phenotype Ontology in PhenotypicFeatures |
‘subject.timeAtLastEncounter’ is missing but it is required |
Add the time at last encounter field |
Note
See Custom validation for more details.
Phenotype validation
Phenopacket-tools offers a validator for checking logical consistency of clinical abnormalities in the phenopacket. The validator assumes Human Phenotype Ontology (HPO) is used to represent the clinical abnormalities and the phenotype validation requires the HPO file to work.
Note
The examples below assume that the latest HPO in JSON format has been downloaded to hp.json
.
Get the HPO JSON from HPO releases.
Note
See Phenotype validators for more details.
Phenopackets use non-obsolete term IDs
The HpoPhenotypeValidator points out if the phenopacket contains obsolete HPO terms:
pxf validate --hpo hp.json ${examples}/validate/phenotype-validation/marfan.obsolete-term.json
It turns out that marfan.obsolete-term.json
uses an obsolete HP:0002631
instead of
the primary HP:0002616
for Aortic root aneurysm:
Validation error |
Solution |
---|---|
Using obsolete id (HP:0002631) instead of current primary id (HP:0002616) in id-C |
Replace the obsolete ID with the primary ID |
The annotation-propagation rule is not violated
Due to the annotation propagation rule, it is a logical error to use both a term and its ancestor
(e.g. Arachnodactyly and Abnormality of finger) for annotation of a single item.
When choosing HPO terms for phenotypic features, the most specific terms should be used for the observed clinical features.
In contrary, the least specific terms should be used for the excluded clinical features.
There is one exception to these rules: a term and its ancestor can co-exist in the phenopacket if the parent term
is observed and the child term is excluded (e.g. phenopacket with present Aortic aneurysm
but excluded Aortic root aneurysm, see marfan.valid.json
).
The HpoAncestryValidator checks that the annotation propagation rule is not violated:
pxf validate --hpo hp.json \
${examples}/validate/phenotype-validation/marfan.annotation-propagation-rule.json \
${examples}/validate/phenotype-validation/marfan.valid.json
Validation error |
Solution |
---|---|
Phenotypic features of id-C must not contain both an observed term (Aortic root aneurysm, HP:0002616) and an observed ancestor (Aortic aneurysm, HP:0004942) |
Remove the ancestor term |
Annotation of organ systems
We can validate presence of annotation for specific organ systems in a phenopacket.
Using the term IDs of the top-level HPO terms, we can validate annotation of Eye, Cardiovascular, and Respiratory organ systems in 3 phenopackets of toy Marfan syndrome patients:
pxf validate --hpo hp.json \
--organ-system HP:0000478 --organ-system HP:0001626 --organ-system HP:0002086 \
${examples}/validate/organ-systems/marfan.all-organ-system-annotated.json \
${examples}/validate/organ-systems/marfan.missing-eye-annotation.json \
${examples}/validate/organ-systems/marfan.no-abnormalities.json
Note
Organ system validation requires HPO ontology. See the Phenotype validation for more details about getting the HPO file.
The HpoOrganSystemValidator will point out one error in the marfan.missing-eye-annotation.json phenopacket:
Validation error |
Solution |
---|---|
Missing annotation for Abnormality of the eye [HP:0000478] in id-C |
Annotate the eye or exclude any abnormality. |
Note
See Organ system validation for more details regarding the organ system validation.
That’s it, you made it to the end of the phenopacket-tools tutorial! We set up the command-line application and covered the conversion and validation functionality. The next section provides an in-depth explanation of the CLI functionality.