SAM 文件学习笔记

Cover image

The SAM Format Specification

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information.

An example

示例
示例

Terminologies and Concepets/名词术语和概念

  • Template
  • Segment
  • Read
  • Linear alignment 线性比对
  • Chimeric alignment 嵌合比对
  • Read alignment
  • Multiple mapping
  • 1-based coordinate system
  • 0-based coordinate system
  • Phred scale

The header section/头部注释部分

header lines match /^@[A-Z][A-Z](\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/

The alignment section/比对结果部分

mandatory fields/必需字段

Col Field Description
1 QNAME 比对片段的(template)的编号
2 FLAG 位标识,templatemapping情况的数字表示,每一个数字代表一种比对情况,这里的值是符合情况的数字相加总和
3 RNAME 参考序列的编号,如果注释中对SQ-SN进行了定义,这里必须和其保持一致,另外对于没有mapping上的序列,这里是‘*’
4 POS 比对上的位置,注意是从1开始计数,没有比对上,此处为0
5 MAPQ MAPpingQualitymapping的质量
6 CIAGR 简要比对信息表达式(CompactIdiosyncraticGappedAlignmentReport),其以参考序列为基础,使用数字加字母表示比对结果,比如3S6M1P1I4M,前三个碱基被剪切去除了,然后6个比对上了,然后打开了一个缺口,有一个碱基插入,最后是4个比对上了,是按照顺序的
7 RNEXT 下一个片段比对上的参考序列的编号,没有另外的片段,这里是‘*’,同一个片段,用‘=’
8 PNEXT 下一个片段比对上的位置,如果不可用,此处为0
9 TLEN TemplateLengthTemplate的长度,最左边得为正,最右边的为负,中间的不用定义正负,不分区段(single-segment)的比对上,或者不可用时,此处为0
10 SEQ 序列片段的序列信息,如果不存储此类信息,此处为’*‘,注意CIGAR中M/I/S/=/X对应数字的和要等于序列长度
11 QUAL queryQUALity序列的质量信息,格式同FASTQ一样
12 OPT 可选字段(optionalfields),格式如:TAG:TYPE:VALUE,其中TAG有两个大写字母组成,每个TAG代表一类信息,每一行一个TAG只能出现一次,TYPE表示TAG对应值的类型,可以是字符串、整数、字节、数组等
  1. QNAME (Query template NAME) string [!-?A-~]{1,254}
  2. FLAG int [0,216-1]
  3. RNAME (Reference sequence NAME of the alignment) string *|[!-()+-<>-~][!-~]*
  4. POS (1-based leftmost mapping POSition of the first matching base. ) int [0,231-1]
  5. MAPQ (MAPping Quality) int [0,28-1]
  6. CIGAR(CIGAR string) string *|([0-9]+[MIDNSHPX=])+
  7. RNEXT( Reference sequence name of the primary alignment of the NEXT read in the template) string *|=|[!-()+-<>-~][!-~]*
  8. PNEXT(Position of the primary alignment of the NEXT read in the template) int [0,231-1]
  9. TLEN(signed observed Template LENgth.) int [-231+1,231-1]
  10. SEQ(segment SEQuence) string *|[A-Za-z=.]+
  11. QUAL(ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format)) string [!-~]+

optional fields/可选字段

All optional fields follow the TAG:TYPE:VALUE format where TAG is a two-character string that matches /[A-Za-z][A-Za-z0-9]/. Each TAG can only appear once in one alignment line. A TAG containing lowercase letters are reserved for end users. In an optional field, TYPE is a single case-sensitive letter which defines the format of VALUE:

参考:

◀ Pileup Format 学习笔记Github 万圣节彩蛋 ▶