Command-line interface
Phenopacket-tools command-line interface (CLI) provides functionality for conversion and validation of the top-level elements of Phenopacket Schema. Here we describe how to set up the CLI application on Linux, Mac and Windows environments.
Note
phenopacket-tools is written in Java 17 and requires Java 17 or newer to run.
We distribute phenopacket-tools in a ZIP archive. The application requires no special installation procedure if Java 17 or better is available in your environment.
Setup
Most users should download the distribution ZIP file with precompiled JAR file from phenopacket-tools release page. However, it is also possible to build the JAR from sources.
Download
phenopacket-tools JAR is provided as a ZIP file and it is available for download from the releases section of our GitHub repository.
The ZIP file contains an executable JAR file along with a README file and example phenopackets required to run the setup and the tutorial.
Build from source code
The source code is available in our GitHub repository. There are 2 requirements for building the app from sources:
Java Development Kit (JDK) 17 or newer must be present in the environment and
$JAVA_HOME
variable must point to JDK’s location. See Setting JAVA_HOME for more details regarding setting up$JAVA_HOME
on Windows, Mac, and Linux.phenopacket-tools uses several open-source Java libraries and a working internet connection is required to download the libraries.
Run the following commands to check out the source code and to build the application:
git clone https://github.com/phenopackets/phenopacket-tools cd phenopacket-tools git checkout tags/1.0.0-RC3 ./mvnw -Prelease package
If the build completes, a ZIP archive “phenopacket-tools-cli-1.0.0-RC3-distribution.zip”
is created in the phenopacket-tools-cli/target
directory. Use the archive in the same way as the archive
downloaded from phenopacket-tools releases.
Set up alias and autocompletion
In this optional step, we set up an alias and autocompletion for phenopacket-tools command-line application. The autocompletion works thanks to the awesome Picocli library and it works on Bash or ZSH Unix shells.
Let’s set up the alias first. To reiterate the tutorial Set up alias section,
Java command line applications are invoked as java -jar executable.jar
. However, such incantation is
a bit too verbose and we can shorten it a bit by defining an alias.
Assuming the distribution ZIP was unpacked into phenopacket-tools-cli-1.0.0-RC3 directory, let’s run the following to set up the alias:
alias pxf="java -jar $(pwd)/phenopacket-tools-cli-1.0.0-RC3/phenopacket-tools-cli-1.0.0-RC3.jar" pxf --help
Now the autocompletion. The autocompletion can simplify using the CLI options by completing the command or option after pressing the TAB key. To enable the autocompletion, make sure the alias for pxf is set up correctly and run:
source <(pxf generate-completion)
The pxf generate-completion
command generates the autocompletion script and source
uses it to set up
the completion. However, the autocompletion will last only for the duration of the current shell session.
To make the autocompletion permanent, store the script file and add the alias and and sourcing into your .bashrc or .bash_profile file:
echo "### Install phenopacket-tools" >> .bashrc echo alias pxf="java -jar $(pwd)/phenopacket-tools-cli-1.0.0-RC3/phenopacket-tools-cli-1.0.0-RC3.jar" >> .bashrc pxf generate-completion > pxf-completion.sh echo source $(pwd)/pxf-completion.sh >> .bashrc
Warning
The autocompletion only works if the alias is set to pxf. Other alias values will not work.
Commands
The command-line interface provides the following commands:
examples
- generate examples of the top-level elementsconvert
- convert top-level elements from v1 to v2 formatvalidate
- validate semantic and syntactic correctness of top-level Phenopacket schema elements
Before we dive into the commands, let’s discuss some common concepts shared by all CLI commands.
Common concepts
We designed the CLI with aim to make it as easy to use as possible. As a result, the phenopacket-tools commands use several common design principles:
The input data can be provided either via the standard input OR as a list of positional parameters.
The input data format is provided using
-f | --format
option. phenopacket-tools supports phenopackets in JSON, YAML, or protobuf formats. In absence of the explicit data format, phenopacket-tools makes an educated guess.The output is written in the input data format.
The top-level element type of the data input is indicated by the
-e | --element
option. According to the Phenopacket Schema, the commands supports phenopacket, family, or cohort elements.The output is written into the standard output stream. Progress, warnings, and errors are reported into standard error.
The CLI operates in a silent mode by default; only warnings and errors are reported. Use
-v
to increase the verbosity; the-v
option can be specified multiple times (e.g.-vvv
).
We discuss the common concepts further at the relevant places of the next sections.
examples
- generate phenopacket examples
The examples
command writes example phenopackets (including family and cohort examples) into
a provided base directory. Starting from a base directory, the examples are written into three sub-folders:
base
|- phenopackets
|- families
\- cohorts
The examples
command requires an optional -o | --output
argument. By default, the examples will be placed
into the current directory.
The following command writes the examples into the path/to/examples
directory:
pxf examples -o path/to/examples
convert
- convert top-level elements from v1 to v2 format
The convert
command converts a phenopacket, family, or a cohort from v1 to v2 format of Phenopacket Schema.
Usage
Let’s assume we have an example phenopacket phenopacket.v1.json
, family family.v1.json
,
and cohort cohort.v1.json
.
We can convert a v1 phenopacket into v2 by running:
cat phenopacket.v1.json | pxf convert > phenopacket.v2.json
Phenopacket-tools makes an educated guess to determine if the input is in JSON, protobuf, or YAML format.
The current format guessing implementation is, however, naïve and can fail in parsing e.g. gzipped JSON file.
Turn the format guessing off by providing the -f | --format
option:
# Explicit JSON input
cat phenopacket.v1.json | pxf convert -f json > phenopacket.v2.json
# Explicit protobuf input
cat phenopacket.v1.pb | pxf convert -f protobuf > phenopacket.v2.pb
The -f | --format
option accepts one of the following 3 values: {json, pb, yaml}
.
By default, the output is written in the format of the input data.
However, we can override this by using --output-format
option:
cat phenopacket.v1.json | pxf convert --output-format pb > phenopacket.v2.pb
The --output-format
option takes the same values as --format
: {json, pb, yaml}
.
The convert
command expects to receive a phenopacket by default. However, it can also convert the other
top-level elements of the Phenopacket schema: family and cohort. Use the -e | --element
option to indicate if
the input is a family
or a cohort
:
cat family.v1.json | pxf convert -e family > family.v2.json
cat cohort.v1.json | pxf convert -e cohort > cohort.v2.json
We can convert one or more item at the time by passing the paths to the input files as a positional parameters.
In case one parameter is provided, the STDIN is ignored and the conversion proceeds in the same way as in the examples
above. The command can accept two or more files as positional parameters for bulk conversion. To perform
the bulk conversion, the -O | --output-directory
option must be provided to set the location of the directory
for writing the converted phenopackets.
For instance:
pxf convert -O converted phenopacket.a.v1.json phenopacket.b.v1.json
converts the input phenopackets and stores the results in the converted
folder. The converted files will be stored
under the same names.
validate
- validate Phenopacket Schema elements
The validate
command checks phenopacket, family, or cohort for the base requirements imposed by
the Phenopacket Schema as well as additional user-defined constraints.
Briefly, to meet the base requirements, the phenopacket must be well formatted (valid Protobuf message, JSON document, etc.)
and meet the requirements of the Phenopacket schema; all REQUIRED attributes are set (e.g. phenopacket.id
and
phenopacket.meta_data
), and MetaData
includes a Resource
for all ontology concepts.
The validation can include a number of additional steps, as required by a project or a consortium. Phenopacket-tools offers several off-the-shelf validators and the CLI uses the validators in the validation workflow if the required resources are present.
Usage
The validate
command can validate one or more phenopacket files provided either via standard input or
as positional parameters. Results are written into the standard output in CSV format including an optional header
containing the validation metadata. The header lines start with #
and contain phenopacket-tools version,
date and time of validation, and the list of validators that were run.
The header is followed by a row with column names, and the individual validation results.
Base validation example
Let’s demonstrate the base validation usage using a few examples. Phenopacket can be validated on a stream:
cat phenopacket.json | pxf validate
or as a positional parameter:
pxf validate phenopacket.json
Use -H | --include-header
to include the validation metadata in the output and store the results in a file:
pxf validate -H phenopacket.json > phenopacket.validation.csv
Custom validation example
On top of the base validation, phenopacket-tools supports validation using a custom requirements. See the Custom validation section to learn how to define a custom JSON schema.
The CLI can be provided with one or more JSON schema documents using the --require
option:
pxf validate --require custom-schema.json phenopacket.json
Phenotype validation
Phenopacket-tools includes off-the-shelf validators for pointing out annotation errors in phenopackets that use Human Phenotype Ontology (HPO) to represent clinical findings of the subjects. The validators check presence of obsolete or unknown ontology concepts and violations of the annotation propagation rule based on a HPO file.
The CLI will automatically add the phenotype validation steps into the validation workflow if path to a HPO JSON file
is provided via the --hpo
option:
pxf validate --hpo hp.json phenopacket.json
Note
The bulk validation where phenopackets are provided as positional parameters is much faster since the HPO graph parsing, a computationally expensive operation, is done only once.
Organ system validation
It can be desirable to check annotation of specific organ systems in the phenopacket. Phenopacket-tools can validate annotation of specific organ systems by using the corresponding top-level HPO concepts, such as Eye, Cardiovascular, or Respiratory organ systems.
The organ systems are provided using -s | --organ-system
option:
pxf validate --hpo hp.json \
-s HP:0000478 \
-s HP:0001626 \
-s HP:0002086 \
phenopacket.json
Note
The organ system validation requires HPO file to run.