brief introduction
In the analysis of tumor genome data, copy number variation analysis is one of the common analysis points. CNVkit is one of the tools used for genome sequencing WGS and WES to call CNV. It is written based on python. The CNV results are relatively reliable and easy to operate.
Download and install
conda is recommended to solve the problem of environment dependency:
conda activate wes conda install cnvkit cnvkit.py --help
usage method
The official website tutorial of the tool has been written in detail. Please see: https://cnvkit.readthedocs.io/en/stable/index.html github: https://github.com/etal/cnvkit You can directly use the following code to call the somatic CNV of tumor samples:
cnvkit.py batch 5.gatk/Tumor*_bqsr.bam \ --normal 5.gatk/Normal*_bqsr.bam \ --output-reference ./8.cnv/cnvkit/my_reference.cnn \ --output-dir ./8.cnv/cnvkit \ --targets ${bed} \ --fasta ${GENOME} \ -p ${nthread} \ --method hybrid \ --annotate ${refFlat} \ --drop-low-coverage \ --scatter --diagram \ 1>./8.cnv/cnvkit/cnvkit.log 2>&1
Where -- targets {bed} specifies the bed file. For WES data, that is, the capture interval-- fasta {GENOME} specifies the reference genome fasta file, and - p {nthread} specifies the number of threads. The -- annotate {refFlat} parameter is optional and has no effect. It specifies the annotation file and can be downloaded with the following code:
wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz
The seg file can be extracted from the cns results. Here are two methods: one is provided by cnvkit, and the other is the awk command written by yourself.
# Method 1 cnvkit.py export seg *bqsr.cns -o gistic.seg sed 's/_bqsr//' gistic.seg # Method 2 awk '{print FILENAME"\t"$0}' *bqsr.cns | grep -v chromosome |sed 's/_bqsr.cns//g' | awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$8"\t"$6}' >final.seg
The obtained SEG file can be loaded into IGV for visualization, and IGV will recognize the file suffix seg.
However, it should be noted that the last step of cnvkit is call.
CNVkit subcommand call
The last step of cnvkit is call, which can correct the number of copies based on B allele frequency, tumor ploidy, tumor purity, patient gender and other indicators (not necessary). The newly generated cns file is listed as cn, that is, the absolute copy number copy number: https://cnvkit.readthedocs.io/en/stable/pipeline.html#call
Based on B allele frequency
If it is based on the B allele frequency, it is necessary to provide the VCF file. The tumor and normal pairing are provided. In the document, the author gives the VCF obtained by mutect, and then carry out certain processing: https://cnvkit.readthedocs.io/en/stable/fileformats.html#vcf , after processing, run with the following command:
cnvkit.py call Sample.cns -y -v Sample.vcf -o Sample.call.cns
However, the problem is that the vcf of mutect is somatic mutation, which is not in line with conventional cognition. It should be germline mutation. This point has been raised on GitHub, but the reply given by the author remains unchanged: https://github.com/etal/cnvkit/issues/601

image.png
In fact, when running mutect2, you can choose to keep the germline site. Just add these two parameters -- genotype germline sites true -- genotype PON sites true when running mutet2. Example code:
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" Mutect2 \ -R ${GENOME} \ -I ./5.gatk/${tumor}_bqsr.bam \ -I ./5.gatk/${normal}_bqsr.bam \ -normal ${normal} \ -L ${interval} \ --panel-of-normals ${pon} \ --germline-resource ${gnomad} \ --genotype-germline-sites true \ --genotype-pon-sites true \ --f1r2-tar-gz ./6.mutect/${id}_f1r2.tar.gz \ -O ./6.mutect/${id}_mutect2.vcf.gz \ 1>./6.mutect/${id}_mutect.log 2>&1 $GATK --java-options "-Xmx20G -Djava.io.tmpdir=./tmp" FilterMutectCalls \ -R ${GENOME} \ -V ./6.mutect/${tumor}_mutect2.vcf.gz \ -O ./6.mutect/${tumor}_somatic.vcf.gz \ 1>>./6.mutect/${tumor}_mutect.log 2>&1
Process the output vcf file. A code processing sentence is provided here, where config is normal and tumor pairing:
$ cat config Normal1 Tumor1 Normal2 Tumor2 Normal3 Tumor3 Normal4 Tumor4 ....
What is needed is the vcf files marked with FilterMutectCalls. Note that only the filtering sites are marked, but there is no real filtering. These vcf files need to be processed. The following processing methods are summarized according to the methods given in the cmvkit document and personal understanding. They may not be very mature. Please read the cnvkit document carefully:
cat config | while read line do arr=(${line}) normal=${arr[0]} tumor=${arr[1]} zcat 6.mutect/${id}_somatic.vcf.gz | \ awk 'BEGIN{FS="\t";OFS="\t"} {if($0~"^#") {print $0} else {if($7!="PASS"){$7="REJECT"} print $0} }' > 8.cnv/cnvkit/${id}_cnvkit.vcf sed -i '2 i ##FILTER=<ID=REJECT,Description="Not somatic due to normal call frequency or phred likelihoods: tumor: 35, normal 35.">' 8.cnv/cnvkit/${tumor}_cnvkit.vcf sed -i "2 i ##PEDIGREE=<Derived=${tumor},Original=${normal}>" 8.cnv/cnvkit/${tumor}_cnvkit.vcf cnvkit.py call 8.cnv/cnvkit/${tumor}_bqsr.cns \ -v 8.cnv/cnvkit/${tumor}_cnvkit.vcf \ --drop-low-coverage \ -o 8.cnv/cnvkit/${tumor}_call.cns done
Based on tumor purity and tumor ploidy
The tumor purity evaluated by Clinicopathology can be used for correction:
cnvkit.py call Sample.cns -y -m clonal --purity 0.65 --ploidy 2 -o Sample.call.cns
You can also use other tools to calculate the results, such as PureCN, THetA2, PyClone or BubbleTree. The tutorial of THetA2 is given in the document: https://cnvkit.readthedocs.io/en/stable/heterogeneity.html#tumor-heterogeneity If purecn is used, you can refer to: https://bioconductor.org/packages/devel/bioc/vignettes/PureCN/inst/doc/Quick.html#52_Recommended_CNVkit_usage Of course, in addition to the tools given by the author, the results of Absolute or Sequenza can also be used. It should be noted that the - M / - method parameter in the above code specifies the call method. There are three kinds: threshold, clone and none
- Threshold: calculate the absolute copy number cn column based on the threshold. Is the default parameter, and the default threshold is: - t=-1.1,-0.4,0.3,0.7
If log2 value is up to | Copy number |
---|---|
-1.1 | 0 |
-0.4 | 1 |
0.3 | 2 |
0.7 | 3 |
... | ... |
- clonal: using the given tumor purity and tumor ploidy, convert the log2 value to the absolute copy number, and then simply round the absolute copy value to an integer. The method is applicable to germ line samples and high-purity tumor samples (such as cell lines)
- none: do not calculate cn
In addition, sex chromosomes can also be corrected based on gender, but the code is not shown here because it is troublesome to batch process this parameter. Finally, the calibration is correct, and the drawing visualization can be carried out based on the final results:
cat config | while read line do arr=(${line}) normal=${arr[0]} tumor=${arr[1]} cnvkit.py scatter \ 8.cnv/cnvkit/${tumor}_bqsr.cnr \ -s 8.cnv/cnvkit/${tumor}_call.cns \ --title ${tumor} --segment-color red \ --y-max 2 --y-min -2 -m 3 \ -v 8.cnv/cnvkit/${tumor}_cnvkit.vcf \ -o 8.cnv/cnvkit/${tumor}_scatter.png \ cnvkit.py diagram \ 8.cnv/cnvkit/${tumor}_bqsr.cnr \ -s 8.cnv/cnvkit/${tumor}_call.cns \ --title ${tumor} \ -o 8.cnv/cnvkit/${tumor}_diagram.pdf \ done
