Quantcast
Channel: selectvariants — GATK-Forum
Viewing all 365 articles
Browse latest View live

Can SelectVariants be used to limit VCF files by interval list

$
0
0

Hello I would like to subset a VCF file to only save a few specific regions of the whole genome. I know some of your tools allow for an interval list to be used to subset the region analyzed. Do you have a tool or are you aware of a tool that would allow me to quickly do this from an interval list or something similar? I could make a little script myself, but I figure sub setting and printing out a specific genomic region of interest in a VCF file has to be a solved problem by GATK.

Thanks for your help!
~Sean


SelectVariants for an array

$
0
0

My indel calling VCF has the following information:

##INFO=<ID=N_MQ,Number=2,Type=Float,Description="In NORMAL: average mapping quality of consensus indel-supporting reads/reference-supporting reads">

so one example of my indels is:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  exom_aladaz
chr1    54690639        .       G       GCCC    .       .       N_AC=0,0;N_DP=19;N_MM=0.0,0.36842105;N_MQ=0.0,33.894737;N_NQSBQ=0.0,31.80368;N_NQSMM=0.0,0.0;N_SC=0,0,19,0;SOMATIC;T_AC=6,6;T_DP=11;T_MM=0.5,0.2;T_MQ=44.5,29.0;T_NQSBQ=30.85,33.36;T_NQSMM=0.016666668,0.0;T_SC=6,0,5,0  GT:GQ   0/1:0

If I want to filter indel calls with N_MQ <30 for both consensus indel-supporting reads/reference-supporting reads, how should I write the SelectVariants for N_MQ=0.0,33.894737?

SelectVariants

$
0
0

Hello,

Can you use SelectVariants with a combined vcf to produce a new vcf containing only variants present in a particular sample eg. you can select out de novo mutations from a combined family vcf?

Thanks

Kath

SelectVariants and discordance

$
0
0

Greetings GATK team!

I hope I'm not making a duplicate question here, but I couldn't find anything regarding this in the forum.

Basically, what I want to do is to use SelectVariants to filter against another call set, but I do not want to be as strict as using -discordance (i.e. 100% discordance rate between the two call sets). I want to say for example: "filter call set A against variants that occur in >90% of call set B".

Is there a way to do this with JEXL expressions maybe?

Kind regards

Release notes for GATK version 2.2

$
0
0

GATK release 2.2 was released on October 31, 2012. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Base Quality Score Recalibration

  • Improved the algorithm around homopolymer runs to use a "delocalized context".
  • Massive performance improvements that allow these tools to run efficiently (and correctly) in multi-threaded mode.
  • Fixed bug where the tool failed for reads that begin with insertions.
  • Fixed bug in the scatter-gather functionality.
  • Added new argument to enable emission of the .pdf output file (see --plot_pdf_file).

Unified Genotyper

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • The genotyper no longer emits the Stand Bias (SB) annotation by default. Use the --computeSLOD argument to enable it.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Fixed annotations (AD, FS, DP) that were miscalculated when run on a Reduce Reads processed bam.
  • Fixed bug for the general ploidy model that occasionally caused it to choose the wrong allele when there are multiple possible alleles to choose from.
  • Fixed bug where the inbreeding coefficient was computed at monomorphic sites.
  • Fixed edge case bug where we could abort prematurely in the special case of multiple polymorphic alleles and samples with drastically different coverage.
  • Fixed bug in the general ploidy model where it wasn't counting errors in insertions correctly.
  • The FisherStrand annotation is now computed both with and without filtering low-qual bases (we compute both p-values and take the maximum one - i.e. least significant).
  • Fixed annotations (particularly AD) for indel calls; previous versions didn't accurately bin reads into the reference or alternate sets correctly.
  • Generalized ploidy model now handles reference calls correctly.

Haplotype Caller

  • Massive runtime performance improvement for multi-allelic sites; -maxAltAlleles now defaults to 6.
  • Massive runtime performance improvement to the HMM code which underlies the likelihood model of the HaplotypeCaller.
  • Added the ability to automatically down-sample out low grade contamination from the input bam files using the --contamination_fraction_to_filter argument; by default the value is set at 0.05 (5%).
  • Now requires at least 10 samples to merge variants into complex events.

Variant Annotator

  • Fixed annotations for indel calls; previous versions either didn't compute the annotations at all or did so incorrectly for many of them.

Reduce Reads

  • Fixed several bugs where certain reads were either dropped (fully or partially) or registered as occurring at the wrong genomic location.
  • Fixed bugs where in rare cases N bases were chosen as consensus over legitimate A,C,G, or T bases.
  • Significant runtime performance optimizations; the average runtime for a single exome file is now just over 2 hours.

Variant Filtration

  • Fixed a bug where DP couldn't be filtered from the FORMAT field, only from the INFO field.

Variant Eval

  • AlleleCount stratification now supports records with ploidy other than 2.

Combine Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Now outputs the first non-missing QUAL, not the maximum.

Select Variants

  • Fixed bug where the AD field was not handled properly. We now strip the AD field out whenever the alleles change in the combined file.
  • Removed the -number argument because it gave biased results.

Validate Variants

  • Added option to selectively choose particular strict validation options.
  • Fixed bug where mixed genotypes (e.g. ./1) would incorrectly fail.
  • improved the error message around unused ALT alleles.

Somatic Indel Detector

  • Fixed several bugs, including missing AD/DP header lines and putting annotations in correct order (Ref/Alt).

Miscellaneous

  • New CPU "nano" parallelization option (-nct) added GATK-wide (see docs for more details about this cool new feature that allows parallelization even for Read Walkers).
  • Fixed raw HapMap file conversion bug in VariantsToVCF.
  • Added GATK-wide command line argument (-maxRuntime) to control the maximum runtime allowed for the GATK.
  • Fixed bug in GenotypeAndValidate where it couldn't handle both SNPs and indels.
  • Fixed bug where VariantsToTable did not handle lists and nested arrays correctly.
  • Fixed bug in BCF2 writer for case where all genotypes are missing.
  • Fixed bug in DiagnoseTargets when intervals with zero coverage were present.
  • Fixed bug in Phase By Transmission when there are no likelihoods present.
  • Fixed bug in fasta .fai generation.
  • Updated and improved version of the BadCigar read filter.
  • Picard jar remains at version 1.67.1197.
  • Tribble jar remains at version 110.

Filtering VCF files

$
0
0

I have used the UnifiedGenotyper to call variants on a set of ~2400 genes (TruSeq Illumina data) from 28 different samples mapped against a preliminary draft genome. I do not have a defined set of SNPs or INDELs to use in recalibration via VQSR.

While the raw VCF has plenty of QUAL scores that are very high, not a single call has a PASS associated with it in the Filter field- all are "." If I use SelectVaraints to filter the VCF based on high QUAL or DP values, or combination, the Filter field remains "." for the returned variants.

Am I doing something wrong, or is the raw file telling me that none of the variant calls are meaningful, in spite of their high QUAL values?

Is there a "best practices" way to go about filtering such a dataset when VQSR can't be employed? If so, I haven't found it.

Select indels with lenght smaller than 10.

$
0
0

How can I select indels with lenght smaller than 10 bp from a vcf file?

I tried

java -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fa --variant INDEL.vcf -o INDEL_maxLenght10.vcf -select 'vc.getIndelLengths().0 < 10'

but the output still contains all the Indels, also the ones larger than 10 bp.

Selecting all variants that overlap a specific position

$
0
0

Hi,

I have a vcf file of indels, and an interval file of single-base positions I am interested in. I would like to select all of the variants in the file that overlap the positions I'm looking at. Using select variants with the interval list I have returns nothing, because the indels are not contained within any intervals. I don't want to just add an arbitrary amount of padding to my intervals because I don't want to include other nearby variants that don't actually overlap my sites.

Is there a way to do this easily in GATK?


De novo quality scores

$
0
0

Hello,

I was just wondering if anyone uses GATK's SelectVariants walker to call de novo mutations (Mendelian violations) and, if so, what -mvq cut-off do they use? My data is exome sequencing with a large range of read depths - from mean target coverage of 14X to >50X.

Thanks,

Kath

SelectVariants '-select' example and syntax

$
0
0

Just noticed that SelectVariants produces "ERROR MESSAGE: Invalid argument value '>' at position 10" when I use the '-select' parameter with the syntax given in your third example [-select "QD > 10.0"].
I'm using GATK 2.6-5, java/1.7.0_25.
It worked well without the whitespace in the expression [-select "QD>10.0"]. Hope that's not just me ;-)
Also, does the example miss the line continuation character just before the '-select' option?

No worries, many thanks for the great tool!

Identify Compound Heterozygotes Using Select Variants

$
0
0

Hello Team,

Is there a best practice for finding compound heterozygotes using GATK? I can easily find recessive pattern variants using the SelectVariants tool, however, I have not been able to find a way to select compound hets. I am sure the Broad has a straightforward way. Is it somehow integrated into the GATK tool set?

I have considered using something like Gemini, but I would prefer to keep tools use to fewer product lines whenever possible.

Thanks for your help!
Sean

Selecting homozygote calls using SelectVariants

$
0
0

Hi,

Is there a way to select ONLY homozygote calls from a vcf file using SelectVariants?

I understand that the GT for homozygotes is 1/1, whereas the genotype for heterozygotes is 0/1. But when I ude the following command, I get an empty vcf file (i.e. only with a header):

$ java -Xmx4g -jar GenomeAnalysisTK.jar -SelectVariants -R reference.fasta --variant input.vcf -select 'GT == "1/1";' -o output.vcf

I'd appreciate your help on this matter.

Thanks!

Sagi

Using SelectVariants to select for multiple expressions

$
0
0

Hi,

I am using both GATK's UnifiedGenotyper and samtools mpileup as callers.

I've used CombineVariants in order to merge the two sets into a single .vcf file as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T CombineVariants -R reference.fasta --variant:GATK GATK.vcf --variant:samtools samtools.vcf -o GATK_samtools.union.vcf -genotypeMergeOptions PRIORITIZE -priority GATK,samtools --filteredrecordsmergetype KEEP_UNCONDITIONAL

Now, I would like to select all calls that were called by both callers, regardless of whether they've been filtered or not.

From opening the GATK_samtools.union.vcf file, I understand that I need to select for the following expressions:

set=Intersection
set=FilteredInAll
set=filterInGATK-samtools

(I was also wondering why I don't get an expression like 'filterInsamtools-GATK'? does this have anything to do with the PRIORITIZE command?)

So... I've been trying to run the following with no luck (i.e. the output .vcf file doesn't contain any variants, but rather only the header):

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta --variant GATK_samtools.union.vcf -select 'set == "Intersection"; -select 'set == "FilteredInAll";' -select 'set == "filterInGATK-samtools";' -o GATK_samtools.overlap.vcf

I've also tried the following, but in this case I only get the an output of the 'set=Intersection' variants, without the rest:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta --variant GATK_samtools.union.vcf -select 'set == 'Intersection';'FilteredInAll';'filterInGATK-samtools'" -o GATK_samtools.overlap.vcf

I'd appreciate any help on this.

Thanks!

Sagi

Filtering Variants using the Format Column + JEXL Oddities

$
0
0

Hi Team,
I have a VCF which I'd like to filter by variant frequency. The problem is, my frequencies are percentages rather than decimals. Is there a workaround in JEXL which allows it to parse the '%' operator as a percentage (or ignore it entirely) rather than considering the field a string upon seeing the modulo operator?
The VCF also has two columns in the format column (a normal and a tumor). Is it possible to drill down into these using just the genotypeFilterExpression/genotypeFilterName flags or must do something else?

Thanks,
Eric T Dawson

Filtering on allele balance by sample using jexl

$
0
0

Hello,

I would like to filter my variants using the SelectVariants walker but it throws an error when I try to filter on allele balance by sample. The jexl expression I use is:

vc.getGenotype("sample").getAB()>=0.25

error is:
unknown, ambiguous or inaccessible method getAB

Is there any way of filtering on this parameter?

Best wishes,

Kath


Use VariantFiltration/SelectVariants tool to choose SNPs matching a position

$
0
0

Hi,

I was wondering if you could use the toolkit to generate a separate VCF file containing only SNPs that are found at a predetermined chromosome and base pair position. I have a plink file which I want to convert back to VCF format and it seems unbelievably hard to do so I thought this may be a good way to get around that problem?

I am aware that vcftools offers this function with the "--positions " option, however for some reason I am getting far more variants than I listed and there is nothing wrong that is obvious with my listed positions/vcf file.

Thanks in advance,
Danica

SelectVariants - Error

$
0
0

Hi,

I am trying to subset a few samples from a VCF using the following command

java -Xmx8g -jar GenomeAnalysisTK-2.6-4-g3e5ff60/GenomeAnalysisTK.jar -T SelectVariants -R Homo_sapiens_assembly19.fasta -V Input.vcf -sf samples.txt-o Out.vcf

Getting the error

ERROR MESSAGE: Key Indel_FS,Indel_QD found in VariantContext field FILTER at 1:985458 but this key isn't defined in the VCFHeader.  We require all VCFs to have complete VCF headers by default.

I checked the Input.vcf and can find that the following lines exists in the VCF

##FILTER=<ID=Indel_FS,Description="FS>200.0">
##FILTER=<ID=Indel_QD,Description="QD<2.0">

Not sure why the error ?

Any help much appreciated..
Many Thanks,
Tinu

exclude uncalled variants using SelectVariants

$
0
0

how to remove those variants with "./."?

Thanks a lot!

Error while running SelectVariants

$
0
0

Hi,

I am running SelectVariants on a vcf file for removing indels >50bps. I am getting following error:

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.4-7-g5e89f01):
ERROR
ERROR Please visit the wiki to see if this is a known problem
ERROR If not, please post the error, with stack trace, to the GATK forum
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Line: chr1 2277268 rs140545544 CCACA C,CCACACA 515.33 PASS . GT:GQ:DP:PL:AD 2/2:5:21:.,.,381,.,5,0:1,12 1/1:45:21:723,45,0,.,.,.:0,15
ERROR ------------------------------------------------------------------------------------------

I am reporting error for version 2.4 as it shows actual line with problem.

Error with version 2.7

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.7-4-g6f46d11):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Error parsing line: org.broad.tribble.readers.LineIteratorImpl@1b6dc6c9, for input source: /home/gaurav/Filter/Arabian_INDELS_NEXOME.vcf
ERROR ------------------------------------------------------------------------------------------

Error Stack trace for version 2.4

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.RuntimeException:
Line: chr1 2277268 rs140545544 CCACA C,CCACACA 515.33 PASS . GT:GQ:DP:PL:AD 2/2:5:21:.,.,381,.,5,0:1,12 1/1:45:21:723,45,0,.,.,.:0,15
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:68)
at org.broad.tribble.TribbleIndexedFeatureReader$QueryIterator.readNextRecord(TribbleIndexedFeatureReader.java:342)
at org.broad.tribble.TribbleIndexedFeatureReader$QueryIterator.next(TribbleIndexedFeatureReader.java:297)
at org.broad.tribble.TribbleIndexedFeatureReader$QueryIterator.next(TribbleIndexedFeatureReader.java:261)
at org.broadinstitute.sting.gatk.refdata.utils.FeatureToGATKFeatureIterator.next(FeatureToGATKFeatureIterator.java:60)
at org.broadinstitute.sting.gatk.refdata.utils.FeatureToGATKFeatureIterator.next(FeatureToGATKFeatureIterator.java:42)
at org.broadinstitute.sting.gatk.iterators.PushbackIterator.next(PushbackIterator.java:65)
at org.broadinstitute.sting.gatk.iterators.PushbackIterator.element(PushbackIterator.java:51)
at org.broadinstitute.sting.gatk.refdata.SeekableRODIterator.next(SeekableRODIterator.java:223)
at org.broadinstitute.sting.gatk.refdata.SeekableRODIterator.next(SeekableRODIterator.java:66)
at org.broadinstitute.sting.utils.collections.RODMergingIterator$Element.next(RODMergingIterator.java:72)
at org.broadinstitute.sting.utils.collections.RODMergingIterator.next(RODMergingIterator.java:111)
at org.broadinstitute.sting.gatk.datasources.providers.RodLocusView.next(RodLocusView.java:122)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$MapDataIterator.next(TraverseLociNano.java:173)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$MapDataIterator.next(TraverseLociNano.java:154)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:271)
at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:145)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:100)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:283)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)
Caused by: java.lang.NumberFormatException: For input string: "."
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:481)
at java.lang.Integer.valueOf(Integer.java:582)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeInts(AbstractVCFCodec.java:703)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:664)
at org.broadinstitute.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:114)
at org.broadinstitute.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:131)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:309)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:241)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:220)
at org.broadinstitute.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:46)
at org.broad.tribble.AsciiFeatureCodec.decode(AsciiFeatureCodec.java:65)
... 25 more

No error stack trace with version 2.7

I tried to validate vcf file using ValidateVariants too, but getting same error.

Release notes for GATK version 2.8

$
0
0

GATK 2.8 was released on December 6, 2013. Highlights are listed below. Read the detailed version history overview here: http://www.broadinstitute.org/gatk/guide/version-history

Note that this release is relatively smaller than previous ones. We are working hard on some new tools and frameworks that we are hoping to make available to everyone for our next release.


Unified Genotyper

  • Fixed bug where indels in very long reads were sometimes being ignored and not used by the caller.

Haplotype Caller

  • Improved the indexing scheme for gVCF outputs using the reference calculation model.
  • The reference calculation model now works with reduced reads.
  • Fixed bug where an error was being generated at certain homozygous reference sites because the whole assembly graph was getting pruned away.
  • Fixed bug for homozygous reference records that aren't GVCF blocks and were being treated incorrectly.

Variant Recalibrator

  • Disable tranche plots in INDEL mode.
  • Various VQSR optimizations in both runtime and accuracy. Some particular details include: for very large whole genome datasets with over 2M variants overlapping the training data randomly downsample the training set that gets used to build; annotations are ordered by the difference in means between known and novel instead of by their standard deviation; removed the training set quality score threshold; now uses 2 gaussians by default for the negative model; numBad argument has been removed and the cutoffs are now chosen by the model itself by looking at the LOD scores.

Reduce Reads

  • Fixed bug where mapping quality was being treated as a byte instead of an int, which caused high MQs to be treated as negative.

Diagnose Targets

  • Added calculation for GC content.
  • Added an option to filter the bases based on their quality scores.

Combine Variants

  • Fixed bug where annotation values were parsed as Doubles when they should be parsed as Integers due to implicit conversion; submitted by Michael McCowan.

Select Variants

  • Changed the behavior for PL/AD fields when it encounters a record that has lost one or more alternate alleles: instead of stripping them out these fields now get fixed.

Miscellaneous

  • SplitSamFile now produces an index with the BAM.
  • Length metric updates to QualifyMissingIntervals.
  • Provide close methods to clean up resources used while creating AlignmentContexts from BAM file regions; submitted by Brad Chapman.
  • Picard jar updated to version 1.104.1628.
  • Tribble jar updated to version 1.104.1628.
  • Variant jar updated to version 1.104.1628.
Viewing all 365 articles
Browse latest View live