||
Sambamba: process your BAM data faster!
对于很大(>100G)的bam文件,排序时间很慢不说,往往需要1天或更多的时间,但结果还会出错。如下边的错误. 经测试Sambamba表现较好,能够节省很多时间。随着接触的数据越来越多,感觉很简单的事情也需要花很多时间。不仅仅是数据多了的问题!
排版不是很好,也可以看这里 http://blog.csdn.net/msw521sg/article/details/65938377
[bam_sort_core] merging from 3288 files...[E::hts_open_format] fail to open file 'CS_RNA_seq.sorted.bam.tmp.1020.bam'[bam_merge_core] fail to open file CS_RNA_seq.sorted.bam.tmp.1020.bam
1、下载地址
https://github.com/lomereiter/sambamba/releases
2、安装
拷贝至全局环境变量路径即可
3、使用
4、view
sambamba-view - tool for extracting information from SAM/BAM/CRAM files
sambamba view [OPTIONS] <input.bam | input.sam | input.cram> [region1 […]]
sambamba view allows to efficiently filter SAM/BAM/CRAM files for alignments satisfying various conditions, as well as access its SAM header and information about reference sequences. In order to make these data readily available for consumption by scripts in Perl/Python/Ruby, JSON output is provided.
By default, the tool expects BAM file as an input. In order to work with CRAM, specify -C and for SAM, specify -S|--sam-input as a command-line option, the tool does NOT try to guess file format from the extension. Beware that when reading SAM, the tool will skip tags which don't conform to the SAM/BAM specification, and set invalid fields to their default values.
FILTERING
Filtering is presented in two ways. First, you can specify a condition with -F option, using a special language for filtering, described at
https://github.com/lomereiter/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax
Second, if you have an indexed BAM file, several regions can be specified as well. The syntax for regions is the same as in samtools: chr:beg-end where beg and end are 1-based start and end of a closed-end interval on the reference chr.
JSON
Alignment record JSON representation is a hash with keys 'qname', 'flag', 'rname', 'pos', 'mapq', 'cigar', 'rnext', 'qual', 'tags', e.g.
{"qname":"EAS56_57:6:190:289:82","flag":69,"rname":"chr1","pos":100,"mapq":0,"cigar":"*","rnext":"=","pnext":100,"tlen":0,"seq":"CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA","qual":[27,27,27,22,27,27,27,26,27,27,27,27,27,27,27,27,23,26,26,27,22,26,19,27,26,27,26,26,26,26,26,24,19,27,26],"tags":{"MF":192}}
JSON representation mimics SAM format except quality is given as an array of integers.
Postprocessing JSON output is best accomplished with https://stedolan.github.io/jq/
The output is one line per read, for building a proper JSON array pipe the output into jq --slurp.
OPTIONS
-F, --filter=FILTER Set custom filter for alignments.
-f, --format=FORMAT Specify output format. FORMAT must be one of sam, bam, cram, or json (in lowercase). Default is SAM.
-h, --with-header Print SAM header before reads. This is always done for BAM output.
-H, --header Print only SAM header to STDOUT. If FORMAT is sam or bam, its text version is printed, otherwise JSON object is written.
-I, --reference-info Output to STDOUT reference sequence names and lengths in JSON (see EXAMPLES).
-L, --regions=BEDFILE Intersect a file with regions specified in the BED file.
-c, --count Output to STDOUT only the number of matching records, -hHI options are ignored.
-v, --valid Output only valid reads.
-S, --sam-input Specify that the input is SAM file (default is BAM for all operations).
-C, --cram-input Specify that input is in CRAM format
-p, --show-progress Show progressbar in STDERR. Works only for BAM files, and with no regions specified, i.e. only when reading full file.
-l, --compression-level=COMPRESSION_LEVEL Set compression level for BAM output, a number from 0 to 9.
-o, --output-filename=FILENAME Specify output filename (by default everything is written to STDOUT).
-t, --nthreads=NTHREADS Number of threads to use.
EXAMPLES
Print basic reference sequence information:
$ sambamba view --reference-info ex1_header.bam [{"name":"chr1","length":1575},{"name":"chr2","length":1584}]
Count reads with mapping quality not less than 50:
$ sambamba view -c-F"mapping_quality >= 50" ex1_header.bam 3124
Count properly paired reads overlapping 100..200 on chr1:
$ sambamba view -c-F"proper_pair" ex1_header.bam chr1:100-200 39
Output header in JSON format:
$ sambamba view --header--format=json ex1_header.bam {"format_version":"1.3","rg_lines":[], "sq_lines":[{"sequence_length":1575,"species":"","uri":"", "sequence_name":"chr1","assembly":"","md5":""}, {"sequence_length":1584,"species":"","uri":"", "sequence_name":"chr2","assembly":"","md5":""}], "sorting_order":"coordinate","pg_lines":[]}
5、sort
sambamba-sort - tool for sorting BAM files
sambamba sort [OPTIONS] <input.bam>
BAM files can have either 'coordinate' sort order, or 'qname' one.
The first one means to sort the file by (integer) reference ID, and for each reference sort corresponding reads by start coordinate.
'qname' sorting order is when reads are sorted lexicographically by their names.
sambamba sort does an external stable-type sort on input file. That means it reads the source BAM file in chunks that fit into memory, sorts them and writes to a temporary directory, and then merges them. After merging temporary files are removed automatically.
Both sorting orders are supported. Default one is 'coordinate' because this is the one used for building index later. In order to switch to 'qname' sorting order, use -n|--sort-by-name flag.
OPTIONS
-m, --memory-limit=LIMIT
Sets an upper bound for used memory. However, this is very approximate. Default memory limit is 2GB. Increasing it will allow to make chunk sizes larger and also reduce amount of I/O seeks thus improving the overall performance.LIMIT must be a number with an optional suffix specyfying unit of measumerent. The following endings are recognized: K, KiB, KB, M, MiB, MB, G, GiB, GB.
--tmpdir=TMPDIR
Use TMPDIR to output sorted chunks. Default behaviour is to use system temporary directory.
-o, --out=OUTPUTFILE Output file name. If not provided, the result is written to a file with .sorted.bam extension.
-n, --sort-by-name Sort by read name lexicographically
-N, --natural-sortSort by read name using natural order (like samtools)
-F, --filter=FILTER Keep only reads that pass FILTER. More on that in sambamba-view documentation.
-l, --compression-level=COMPRESSION_LEVEL
Compression level to use for sorted BAM, from 0 (known as uncompressed BAM in samtools) to 9.
-u, --uncompressed-chunks
Write sorted chunks as uncompressed BAM. Default behaviour is to write them with compression level 1, because that reduces time spent on I/O, but in some cases using this option can give you a better speed. Note, however, that the disk space needed for sorting will typically be 3-4 times more than without enabling this option.
-p, --show-progressShow wget-like progressbar in STDERR (in fact, two of them one after another, first one for sorting, and then another one for merging).
-t, --nthreads=NTHREADS Number of threads to use.
6、merge
sambamba-merge - tool for merging several BAM files into one
sambamba merge [OPTIONS] <output.bam> <input1.bam> <input2.bam> […]
DESCRIPTION
sambamba merge is used for merging several sorted BAM files into one. The sorting order of all the files must be the same, and it is maintained in the output file.
SAM headers are merged automatically like in Picard merging tool.
OPTIONS
-t, --nthreads=NTHREADS
Specify number of threads to use for compression/decompression tasks.
-l, --compression-level=COMPRESSION_LEVEL
Specify compression level of output file as a number from 0 to 9
-H, --header Output only merged header to stdout in SAM format (e.g. for debugging purposes). Other options are ignored.
-p, --show-progressShow progressbar in STDERR.
7、slice
sambamba-slice - copying a slice of BAM file
sambamba slice OPTIONS <input.bam> region
DESCRIPTION
Outputs reads overlapping specified region into new BAM file. (Default destination is STDOUT.) Input file must be coordinate-sorted and indexed.
While the same can be done with sambamba-view, that would be much slower due to a lot of compression/decompression work. Instead of naive method, sambamba-slice leverages knowledge about structure of BAM file and recompresses only a few BGZF blocks at the beginning and the end of the region, while the BGZF blocks in the middle are copied as is. As such, this tool doesn't offer any options related to number of threads or compression level - most of the time is spent on I/O operations.
OPTIONS
-o, --output-filename=OUTPUTFILE
Name of output file
8、flagstat
sambamba-flagstat - getting flag statistics from BAM file
sambamba flagstatOPTIONS<input.bam>
DESCRIPTION
Outputs some statistics drawn from read flags.
First line contains numbers of QC-passed and QC-failed reads. Then come pairs of numbers, the former for QC-passed reads, the latter for QC-failed ones:
duplicates
mapped reads (plus percentage relative to the numbers from the first line)
reads with 'is_paired' flag set
paired reads which are first mates
paired reads which are second mates
paired reads with 'proper_pair' flag set (plus percentage relative to the numbers of QC-passed/failed reads with 'is_paired' flag set)
paired reads where both mates are mapped
paired reads where mate is mapped to a different chromosome
the same as previous but mapping quality is not less than 5
OPTIONS
-t, --nthreads=NTHREADS
Specify number of threads to use for BAM decompression
-p, --show-progress
Show progressbar in STDERR
8、markdup
sambamba-markdup - finding duplicate reads in BAM file
sambamba markdup OPTIONS <input.bam> <output.bam>
Marks (by default) or removes duplicate reads. For determining whether a read is a duplicate or not, the same criteria as in Picard are used.
OPTIONS
-r, --remove-duplicates
remove duplicates instead of just marking them
-t, --nthreads=NTHREADS
number of threads to use
-l, --compression-level=N
specify compression level of the resulting file (from 0 to 9)");
-p, --show-progress
show progressbar in STDERR
--tmpdir=TMPDIR
specify directory for temporary files; default is /tmp
--hash-table-size=HASHTABLESIZE
size of hash table for finding read pairs (default is 262144 reads); will be rounded down to the nearest power of two; should be > (average coverage) * (insert size) for good performance
--overflow-list-size=OVERFLOWLISTSIZE
size of the overflow list where reads, thrown away from the hash table, get a second chance to meet their pairs (default is 200000 reads); increasing the size reduces the number of temporary files created
--io-buffer-size=BUFFERSIZE
controls sizes of two buffers of BUFFERSIZE megabytes each, used for reading and writing BAM during the second pass (default is 128)
BUGS
External sort is not implemented. Thus, memory consumption grows by 2Gb per each 100M reads. Check that you have enough RAM before running the tool.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2025-2-5 11:57
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社