Construction of Nr database sub database

Posted by dc277 on Thu, 17 Feb 2022 05:52:07 +0100

After genome assembly or transcriptome differential gene expression analysis, annotation of the results is an important step. How to annotate and how to obtain accurate annotation results?

Nr database, fully known as non redundantprotein sequence database, refers to the protein database containing the sequences in GenPept, Swissprot, PIR, PDF, PDB and RefSeq Database.

(1) Database download

Personal download methods are divided into graphical interface download and command line download.

Software installation required:

# blast
tar -zxvf ncbi-blast-2.12.0+-x64-linux.tar.gz
echo 'PATH=your_path/blast/ncbi-blast-2.12.0+/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

# aspera
wget -c
tar -zxvf ibm-aspera-connect_4.1.0.46-linux_x86_64.tar.gz
# Automatic installation
sh ibm-aspera-connect_4.1.0.46-linux_x86_64.tar.gz
echo 'PATH=~/.aspera/connect/bin/:$PATH' >> ~/.bashrc
source ~/.bashrc

1. Manual download, that is, download under the graphical interface

[note] the data downloaded here is in fasta format, not pre formatted (i.e. Nr database built in advance)

The steps are:

(1) Log in NCBI official website

(2) Click Download

(3) Click FTP

(4) Enter the corresponding path:

(5) Right click to copy nr.gz and nr.gz Link to MD5

(6) Select Xunlei and other tools to download

2. Command line interface download

1. aspera

1.1 download nr.gz

ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -l 200M -k 1 --overwrite=diff -QT ./
# Description of parameters used:
-i   # Use public key authentication and specify the private key file. 
     # The storage location of the key is generally ~ / aspera/connect/etc/
-l   # Set the target transfer rate in Kbps. Default: 10000
     # That is, set the maximum transmission rate
-k   # The optional parameter is {0 | 1 | 2 | 3}, that is, breakpoint continuation, which is generally set to 1
     # Enable resuming partially transferred files. (Default: 0). Please note that this must be specified for your first transfer; otherwise, it will not work for subsequent transfers.
     # "- k 1 specific description": - k 1 if the transfer stops because of connection loss or other issues the K option calls the transfer to resume from where it left off her than restarting the entire transfer over 
     ### Description of other parameters:
     # 0 Always retransfer the entire file.
     # 1 Check file attributes and resume if the current and original attributes match.
     # 2 Check file attributes and do a sparse file checksum; resume if the current and original attributes/checksums match.
     # 3 Check file attributes and do a full file checksum; resume if the current and original attributes/checksums match.
--overwrite=diff   # The default setting is diff, which means that if the currently downloaded file is inconsistent with the previously downloaded file, it will be overwritten
-Q   # Enable fair (-Q) or trickle (-QQ) transfer policy. Use the -l and -m to set the target and minimum rates.
-T   # Disable encryption for maximum throughput.

1.2 download the preformatted version

for i in {00..55}
   echo "ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -l 200M -k 1 --overwrite=diff -QT${i}.tar.gz ./" >>
   echo "ascp -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh -l 200M -k 1 --overwrite=diff -QT${i}.tar.gz.md5 ./" >>
ParaFly -c -CPU 1 &

# Check whether the downloaded database is complete
for i in {00..55};do md5sum -c nr.${i}.tar.gz.md5;done

2. Download the built-in script of blast +

Version: version 2.12.0+

Download object: preformatted database

perl ~/biosoft/blast/ncbi-blast-2.12.0+/bin/ --blastdb_version 5 --decompress nr --source gcp --passive

# Description of parameters used:
--blastdb_version  # Specify which BLAST database version to download (default: 5). 
                   # Supported values: 4, 5
--decompress       # Downloads, decompresses the archives in the current working directory, and deletes the downloaded archive to save disk space, while preserving the archive checksum files (default: false).
--source           # to download BLAST databases from (default: auto-detect closest location). Supported values: ncbi, aws, or gcp.
--passive          # Use passive FTP, useful when behind a firewall or working in the cloud (default: true). 
                   # To disable passive FTP, configure this option as follows: --passive no

3.wget Download

Download object: preformatted database

# website:
# Write your own shell script to download
for i in {00..55}
  echo "wget -c${i}.tar.gz ./" >>
  echo "wget -c${i}.tar.gz.md5 ./" >>

ParaFly -c -CPU 10 2>err &

(2) Extracting taxid from corresponding species: Taking plants as an example

Software installation required:

# taxonkit
conda install -c bioconda taxonkit
# csvtk
conda install csvtk

NCBI uses taxid to divide species, which forms the database NCBI Taxonomy.

1. Query the taxid of the corresponding species and groups

A. NCBI official website manual query method

1. Enter the NCBI official website and select "Taxonomy" in the database

2. Enter "Plants" to get the corresponding query results (specifically land plants, i.e. terrestrial Plants. The green plant community covers a wider range)

B. taxdump.tar.gz query

grep -P "\|\s+[pP]lant\w*\s*\|" ~/.taxonkit/names.dmp

C. taxonkit query

echo Embryophyta | taxonkit name2taxid --show-rank

2. Obtain the sequence contained in the corresponding species group (accessionID acquisition)

1. Manually download from NCBI official website

(1) Select "Protein" for the database. The query format is txidxxxxx [organization] or txidxxxxx [org].

[annotation] add "[]" to represent the whole to which the corresponding taxid belongs.

(2) The query results are as follows:

2. Use taxonkit to obtain the access ID of the corresponding species

NCBI classification data & taxi access ID conversion file needs to be downloaded

wget -c
wget -c
md5sum -c taxdump.tar.gz.md5 # Check whether the downloaded database is complete
mkdir ~/.taxonkit
tar zxf taxdump.tar.gz -C ~/.taxonkit

wget -c
wget -c
md5sum -c prot.accession2taxid.gz.md5
# Get the taxids of the corresponding class group
taxonkit list --ids 3193 --indent "" > plants.taxid.txt

# Get the access ID of the corresponding class group
pigz -dc prot.accession2taxid.gz | csvtk grep -t -f taxid -P plants.taxid.txt | csvtk cut -t -f accession.version,taxid | sed 1d > plant.acc2taxid.txt && cut -f 1 plant.acc2taxid.txt > plant.acc.txt &

3. Use the blast related script to obtain the access ID

To use blast's own script, to obtain the access ID of the corresponding species, you need to install the program edirect command lineutility in advance. The link to install manual is: -t 3193 > plants.blast.taxids.txt
# After the above taxids list is obtained, the access ID can be obtained by combining csvtk

3. Construction of nr sub Library of corresponding species group

[annotation] the sub database is constructed here. Take the preformed NR database as an example

# Unzip the downloaded preformed NR database
for i in {00..55}
  tar -zxvf nr.${i}.tar.gz -C ../nr_preformatted_20211122_decompress

1.blastdb_ Building blast sub library with alias tool

blastdb_aliastool -seqidlist sequence.seq -db nr -out nr_plant -title nr_plant
# sequence.seq is the access list of the corresponding class group

The result is an index file: nr_plant.pal.

2. Extract the corresponding group sequence and construct the local nr database sub database by ourselves

Method 1:

blastdbcmd -db nr -entry_batch sequence.seq -out - | pigz -c >

Method 2:

# Extract all sequences in the preformed NR database
blastdbcmd -db nr -dbtype prot -entry all -outfmt "%f" -out - | pigz -c > nr.fa.gz

time cat <(echo) <(pigz -dc nr.fa.gz) \
    | perl -e 'BEGIN{ $/ = "\n>"; <>; } while(<>){s/>$//;  $i = index $_, "\n"; $h = substr $_, 0, $i; $s = substr $_, $i+1; if ($h !~ />/) { print ">$_"; next; }; $h = ">$h"; while($h =~ />([^ ]+ .+?) ?(?=>|$)/g){ $h1 = $1; $h1 =~ s/^\W+//; print ">$h1\n$s";} } ' \
    | seqkit grep -f 3193.acc.txt -o nr.3193.fa.gz

[interesting phenomenon]

After the extraction is completed, blastdbcmd. Is found nr.3193.fa.gz and nr.3193 fa. GZ has different file sizes (first 13G, last 4G). After checking the number of sequences, it is found that the number of sequences contained in the two result files is the same. The reason for the above results is the result extracted by the official software and there is too much redundant information in the sequence header.

seqkit stats
file           format  type       num_seqs        sum_len  min_len  avg_len  max_len  FASTA   Protein  19,326,820  7,670,152,256        6    396.9   23,040

seqkit stats nr.3193.fa.gz
file           format  type       num_seqs        sum_len  min_len  avg_len  max_len
nr.3193.fa.gz  FASTA   Protein  19,326,820  7,670,152,256        6    396.9   23,040

If the downloaded nr.gz (fasta format), that is, the NR database is not built in advance, you need to use makeblastdb or diamond makedb locally to build the local database

reference material