The Resilience Project – Software
The Resilience Project – Software
The Resilience Project offers two software tools to the community to scan their sequencing or genotyping data for potential unexpected heroes.
You can use these two Java tools to scan VCF and other files for 1) exact allele matches against the UHP core allele panel as well as 2) early terminations in a list of genes; from version 2 onwards, formats such as GFF, CNS, and samtools pileup calls will be supported in additio to VCF.
There is one Java Archive (.jar) for each to the two tasks, searching for exact alleles and searching for early terminations (nonsense or frameshift):
- SearchForUnexpectedHeroes.jar (v1.16 Mar-14-2014)
- SearchForUnexpectedHeroes2.jar (v2.00b Mar-14-2014)
- SearchForEarlyTermination.jar (v1.07 Aug-07-2014)
All tools can be run using Java version 1.6 or above (see below for example calls) and do not need to be unpacked. The archives contain all necessary panel and gene information; however, if you would like to edit them, use the following two files as a baseline:
- panel.vcf — SHP allele panel
- genes.list — SHP gene/disease panel
- The allele panel is also available with HG18 coordinates:
- panel_hg18.vcf
“SearchForEarlyTermination” can work off of both the VCF and the gene list. “SearchForUnexpectedHeroes” will consider the panel.vcf only.
Note that in both cases, certain annotations are mandatory. In genes.list, the columns for gene symbol, phenotype, inheritance, chromosome, gene start and end position have to be filled in. The panel.vcf requires the chromosomal location and alleles of the variant (chr, pos, ref, alt), the phenotype the inheritance (as INFO fields PHEN and INH, respectively.)
VCF input files
The tools can work on single-sample, multi-sample, and summarized VCF files. For a format description of VCF, see http://www.1000genomes.org/node/101.
Single sample VCF files have two extra columns after the INFO column, namely FORMAT and a column quantifying data on each variant for this sample. The data column typically has a header matching the sample name. Multi-sample VCF files have one column describing the FORMAT, and then one column per sample using the keys defined in FORMAT to describe each sample for each position quantitatively. In both cases of single- and multi-sample VCF files, the sample column(s) would typically show the called genotype (usually using a field called “GT” and listed in the FORMAT column) and the genotype likelihood (“GL”) but might contain read depth (DP), alternate allele counts (AC), among others. Our tools make use of the GT information, comparing it to the required inheritance for each disease allele (autosomal recessive or dominant, X-linked). A match with an autosomal-recessive disease allele will be reported only for samples that have been called homozygous for the alternate allele (GT has a value of “1/1”). The example at the top of this page http://www.1000genomes.org/node/101 shows a multi-sample VCF file for three samples named NA00001, NA00002, and NA00003; per sample, the file shows quantitative data describing the genotype (GT), genotype quality (GQ), read depth (DP), and haplotype quality (HQ).
Summarized VCF files lack information on specific samples, but rather sum up allele counts and frequencies over all samples, potentially separated by population. An example for a summarized VCF file comes from the 1000 Genomes Project: ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz. This file summarizes, per observed variant site, the allele frequencies across four populations, but contains no sample-specific data on any of the 1092 individuals used. The latter are contained in files that are provided by chromosome, for instance, ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz. SearchForUnexpectedHeroes can work with both files; however, if you use the first file, the output of SearchForUnexpectedHeroes will only tell you that there might be a candidate in your dataset; it cannot tell you which sample(s) is the candidate or give you GT/GQ measures. Still, it will be valuable to know whether there is a candidate in your (summarized) data, so that you can then follow up and look into the details for the specific variant site or sites.
We highly recommend that you compress and index your VCF files using bgzip and tabix, respectively. These tools come in on archive that you can obtain from http://sourceforge.net/projects/samtools/files/tabix/. See http://samtools.sourceforge.net/tabix.shtml for a description. To compress a VCF file, run
bgzip myvariantsfile.vcf
which will create a file myvariantsfile.vcf.gz. Note that while bgzip and gzip share a common file extension (.gz), they are not the same. Gzip is installed by default on many (Unix/Linux) machines. The subsequent run of tabix will only work properly with a bgzipped file, as will our tools! Run tabix like so:
tabix -f -p vcf myvariantsfile.vcf.gz
which will produce an index file myvariantsfile.vcf.gz.tbi. Subsequently, when calling our tools, you will need to specify the filename (and path, if necessary) of the .gz file, not the .gz.tbi file. However, both files must reside in the same folder.
GFF input files
Since version 2, GFF is supported; see http://useast.ensembl.org/info/website/upload/gff.html for an description of this format.
For faster access, you can compress and index your GFF files (similar to VCF described above); the command for tabix would be
tabix -f -p gff myvariantsfile.gff.gz
To run SearchForUnexpectedHeroes2 on GFF, add the command line parameter --gff to each call.
SearchForUnexpectedHeroes
To see a list of command line options, run
$ java -jar SearchForUnexpectedHeroes.jar SearchForUnexpectedHeroes Scans VCF files for unexpected hero candidates. For large VCF files (more than a few hundred variants) it is recommended to use a bgzipped file together with a tabix index. ScanVcf <vcf-file> | -i <list> | -d <folder> Parameters: <vcf-file> - VCF file, format spec 4.0 or above; accepts .vcf and .vcf.gz; for zipped files, a tabix index named .vcf.gz.tbi is required -i <list> - reads the full paths and names of VCF(.gz) files from this file; -d <dir> - scans all VCF files (.vcf and .vcf.gz) in the given directory Any combination of multiple -i, -d, and <vcf-file> is supported -p <vcf> - read the UHP core allele panel from this VCF file -o <file> - write output to the specified file; format: VCF -v <int> - verbosity --chr - use this option if the prefix of the chromosome name in your file is 'chr'; note that 'chr' should be spelled in exactly the same capitalization as used in your files: Chr, CHR, CHROM are all valid but are used case-sensitively -v <int> - verbosity Supported from version 2 onwards: --gff - Input file is in GFF format; default: VCF --unsorted - Positions in the input file are not sorted by chr (1..22,X,Y,MT) and start; default: sorted; VCF files have to be sorted in any case --DP <i> - minimum coverage required at any given site to be considered --GQ <f> - minimum required genotype quality
You can run SearchForUnexpectedHeroes against a single VCF file, multiple VCF files, all VCF files found in a folder, a file containing a list of VCF files, or any combination. A simple example to scan a single VCF file would look like
java -jar SearchForUnexpectedHeroes.jar myvariantsfile.vcf.gz
If you do not specify a panel VCF using the -p option, SearchForUnexpectedHeroes will use the internal, “hard-coded” allele panel. Should you want to use your own panel to scan the data against, or want to use the dmpanel.vcf provided by us that contains mutations from HGMD, use the -p switch to define another panel, which has to be a VCF file:
java -jar SearchForUnexpectedHeroes.jar -p myownpanel.vcf myvariantsfile.vcf.gz
SearchForEarlyTermination
To see a list of command line options, run
$ java -jar SearchForEarlyTermination.jar ScanVcfForEarlyTermination Scans VCF files for unexpected hero candidates by checking for deleterious mutations (nonsense and frameshift) in disease-associated genes. For large VCF files (more than a few hundred variants) it is recommended to use a bgzipped file together with a tabix index. ScanVcf <vcf-file> | -i <list> | -d <folder> Parameters: <vcf-file> - VCF file, format spec 4.0 or above; accepts .vcf and .vcf.gz; for zipped files, a tabix index named .vcf.gz.tbi is required -i <list> - reads the full paths and names of VCF(.gz) files from this file; -d <dir> - scans all VCF files (.vcf and .vcf.gz) in the given directory Any combination of multiple -i, -d, and <vcf-file> is supported -p <vcf> - read your own UHP allele panel from this VCF file -g <file> - read your own list of genes from this file -o <file> - write output to the specified file; format: VCF -chr - use this option if the prefix of the chromosome name in your VCF is 'chr' -v <int> - verbosity
SearchForEarlyTermination searches the genomic regions specified in genes.list or panel.vcf in the given VCF file and looks for any variant coding for a nonsense mutation or a frameshift. To do so, the VCF file needs to be annotated with genes and each variant’s effect (missense, nonsense, frameshift, intron, splice site, etc.). We recommend to use the tool snpEff (available at http://snpeff.sourceforge.net/) for this task. snpEff also takes a VCF file as input and produces a new VCF file which add aforementioned annotations to the VCF’s INFO column (“EFF” field).
SearchForEarlyTermination currently supports the following kinds of annotations for variant impacts:
- snpEff: adds a field ‘EFF=’ to the INFO column in VCF files; this will show, per available transcript, the impact in a form like
EFF=INTRON(MODIFIER|||||HPS4|retained_intron|CODING|ENST00000485842|4|1), STOP_GAINED(HIGH|NONSENSE|Cga/Tga|R241*|721|HPS4|protein_coding|CODING|ENST00000398141|8|1) - UK10K uses the CSQ field in the INFO column:
CSQ=ENST00000341426:NADK:inframe_insertion:1309-1336:437-446: EEEEEEEEEG>EEEEEEEEEEG+ENST00000341991:NADK:inframe_insertion:1309-1336:437-446: EEEEEEEEEG>EEEEEEEEEEG+ENST00000342348:...;... - EVS/ESP use GL and FG fields in the INFO columns to denote the gene and functional impact, respectively:
GL=PEX26;...;FG=NM_017929.5:missense,NM_001199319.1:missense,NM_001127649.2:missense;...
To run snpEff, you will need to download the snpEff tool itself (a Java archive just like our two tools) from here: http://snpeff.sourceforge.net/download.html. Unzip the downloaded archive into a folder of your choice.
You will also need snpEff’s copy of a transcript model (they use ENSEMBL). To obtain a copy, run
java -jar snpEff.jar download GRCh37.69
which will download a transcript model mapped to the latest copy of the human reference genome (for our purposes, equal to HG19). You can now run snpEff on your VCF file to produce an annotated version for our SearchForEarlyTermination tool.
java -Xmx4g -jar snpEff.jar eff -v GRCh37.66 myvariantsfile.vcf > myvariantsfile.annotated.vcf
To give you an idea about the expected execution time of snpEff: on a typical machine, they can annotate the almost 40 millions variants in the 1000 Genomes in about 15 minutes (see http://snpeff.sourceforge.net/1kg.html for an overview).
Don’t forget to bgzip and tabix the resulting VCF file. You should then see myvariantsfile.annotated.vcf.gz and a file myvariantsfile.annotated.vcf.gz.tbi in your folder. On those files, you will then be able to run “SearchForEarlyTermination” using
java -jar SearchForEarlyTermination.jar myvariantsfile.annotated.vcf.gz
Output
The output for both tools with be a human-readable summary of variants found in the VCF file(s) and –where available– a listing of all samples that exhibited said variant. The output fill mention the phenotype, gene, variant, and the inheritance pattern for your reference. Here is an example of SearchForUnexpectedHeroes run on the 1000 Genomes data for chromosome 3, which contains per-individual data:
Command:
java -jar SearchForUnexpectedHeroes.jar ALL.chr6.snps_indels_svs.genotypes.vcf.gz
Output:
Matches for 21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia, Classic, CYP21A2 c.719T>A p.M240K 6:32007593T>A (rs6476) are Sample NA19095 in file ALL.chr6.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz, genotype 1|1, matches panel entry 21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia, Classic, inheritance: AR
showing you that there was one match against our default panel, namely for 21-Hydroxylase-Deficient Congenital Adrenal Hyperplasia, in the gene CYP21A2. The sample ‘NA19095′ in the 1000 Genomes data appears to be a homozygous carrier of the specified genetic variant, rendering them a candidate.
In case your VCF contained only summarized data and no sample-specific information (see above), the output will list possible candidate matches, similar to this example, which originates from a run of SearchForUnexpectedHeroes on the 1000 Genomes’ ALL.wgs.phase1_release_v3.20101123.snps_indels_sv.sites.vcf.gz file:
#WARN found a match but no sample data for this position: 1:40557070-40557070:T>A PPT1:p.R122W (AR) #VCF entry for the potential match: 1 40557070 rs137852695 T A 100 PASS AA=T;AC=3;AF=0.0014;AN=2184;AVGPOST=1.0000;ERATE=0.0003;EUR_AF=0.0040;LDAF=0.0014;RSQ=0.9844;SNPSOURCE=EXOME;THETA=0.0004;VT=SN #WARN found a match but no sample data for this position: 1:76198337-76198337:G>A ACADM:p.E43K (E18K) (AR) #VCF entry for the potential match: 1 76198337 rs147559466 G A 100 PASS SNPSOURCE=LOWCOV,EXOME;AA=G;AN=2184;AC=6;THETA=0.0005;VT=SNP;LDAF=0.0028;RSQ=0.9792;ERATE=0.0003;AVGPOST=0.9999;AF=0.0027;EUR_AF=0.01 #WARN found a match but no sample data for this position: 1:76226846-76226846:A>G ACADM:p.K329E (K304E) (AR) #VCF entry for the potential match: 1 76226846 rs77931234 A G 100 PASS AVGPOST=1.0000;SNPSOURCE=LOWCOV,EXOME;AN=2184;VT=SNP;AA=A;LDAF=0.0023;AC=5;RSQ=1.0000;THETA=0.0010;ERATE=0.0002;AF=0.0023;EUR_AF=0.01
It would now be up to the user to follow up and obtain per-sample data and interpret the potential matches. In this example, some interesting data still comes with the output, namely allele frequencies for different populations, giving insights into possible carriers of heterozygous and homozygous variants.
VCF output
You can also obtain the list of all matches in a single VCF file. Use the -o <file> command line parameter to specify the name of the output file. VCF entries in the output file will reflect the information from the unexpected hero core allele panel (position, phenotype, gene) and show the matching sample names and source file names. Here is an example output:
##fileformat=VCFv4.1 ##INFO=<ID=AGE,Number=1,Type=Integer,Description="Age of onset: 1, 2, or 3"> ##INFO=<ID=CLIN,Number=A,Type=String,Description="Clinical annotation"> ##INFO=<ID=CYT,Number=A,Type=String,Description="Cytoband"> ##INFO=<ID=DCAT,Number=A,Type=String,Description="Disease category"> ##INFO=<ID=EXCL,Number=A,Type=String,Description="Flags entries to be excluded from further analyses, for any reason"> ##INFO=<ID=EV,Number=1,Type=Integer,Description="Evidence: 0, 1, or 2"> ##INFO=<ID=FLAG,Number=A,Type=String,Description="Flags incorrect or irrelevant variants: mild phenotype, likely polymorphism, etc. Should be excluded from further analyses."> ##INFO=<ID=GENE,Number=A,Type=String,Description="Gene symbol, gene affected by this variant"> ##INFO=<ID=GID,Number=1,Type=Integer,Description="Entrez Gene ID"> ##INFO=<ID=HGVSC,Number=A,Type=String,Description="CDS change, HGVS formatted"> ##INFO=<ID=HGVSP,Number=A,Type=String,Description="Protein change, HGVS formatted"> ##INFO=<ID=INH,Number=A,Type=String,Description="Inheritance: AR or AD"> ##INFO=<ID=MCAT,Number=1,Type=Integer,Description="Mutation category: 1=SNV, 2=small indel, 3=large indel, 4=nucleotide repeat, 5=CNV"> ##INFO=<ID=NUC,Number=A,Type=String,Description="cDNA reference"> ##INFO=<ID=PHEN,Number=A,Type=String,Description="Phenotype"> ##INFO=<ID=PNT,Number=1,Type=Integer,Description="Penetrance: 1, 2, or 3"> ##INFO=<ID=PRV,Number=A,Type=String,Description="Prevalence"> ##INFO=<ID=RF,Number=A,Type=String,Description="Literature reference, usually a PubMed ID"> ##INFO=<ID=SEV,Number=1,Type=Integer,Description="Severity: 1, 2, or 3"> ##REF=<ID=DEL,Description="Imprecise deletion, or deletion of unknown or unspecified sequence"> ##ALT=<ID=INS,Description="Imprecise insertion, or insertion of unknown or unspecified sequence"> ##FILTER=<ID=EXCL,Description="Flags entries to be excluded from further analysis, for any reason; details can be found in the INFO column for the EXCL key"> ##FILTER=<ID=FLAG,Description="Flags incorrect or irrelevant variants: mild phenotype, likely polymorphism, etc. Details can be found in the INFO column for the FLAG key"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype, 0/1 for INH=AD, 1/1 otherwise"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE 17 78078341 rs199951626 T G 100 PASS PHEN=Glycogen Storage Disease Type II / Pompe Disease;DCAT=MB;MCAT=1;PNT=3;PRV="1:14,000 in AA;1:100,000 in European";INH=AR;SEV=2;AGE=2;GENE=GAA;GID=2548;NUC=NM_000152.3;HGVSC=c.-32-13T>G (IVS1-13T>G);CYT=17q25.3;CLIN=GR;EV=0 GT 1/1 #sample=UK10K_ABDN5050487, file=UK10K_ABDN5050487.REL-2013-04-20.vcf.gz
We would be greatful if you could simply share the output you obtained with us, or gat back with questions as to how to run the tool on your data and how to interpret the results. Please feel free to contact Dr. Jörg Hakenberg at Mt Sinai, joerg.hakenberg(a)mssm.edu.