SAM 文件学习笔记

The SAM Format Specification

SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with ‘@’, while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information.

An example


Terminologies and Concepets/名词术语和概念

  • Template
  • Segment
  • Read
  • Linear alignment 线性比对
  • Chimeric alignment 嵌合比对
  • Read alignment
  • Multiple mapping
  • 1-based coordinate system
  • 0-based coordinate system
  • Phred scale

The header section/头部注释部分

header lines match /^@[A-Z][A-Z](\t[A-Za-z][A-Za-z0-9]:[ -~]+)+$/ or /^@CO\t.*/

The alignment section/比对结果部分

mandatory fields/必需字段

2FLAG位标识,template mapping情况的数字表示,每一个数字代表一种比对情况,这里的值是符合情况的数字相加总和
5MAPQMAPping Quality mapping的质量
6CIAGR简要比对信息表达式(Compact Idiosyncratic Gapped Alignment Report),其以参考序列为基础,使用数字加字母表示比对结果,比如3S6M1P1I4M,前三个碱基被剪切去除了,然后6个比对上了,然后打开了一 个缺口,有一个碱基插入,最后是4个比对上了,是按照顺序的
9TLENTemplate Length Template的长度,最左边得为正,最右边的为负,中间的不用定义正负,不分区段(single-segment)的比对上,或者不可用时,此处为0
11QUALquery QUALity 序列的质量信息,格式同FASTQ一样
12OPT可选字段(optional fields),格式如:TAG:TYPE:VALUE,其中TAG有两个大写字母组成,每个TAG代表一类信息,每一行一个TAG只能出现一次,TYPE表示TAG对应值的类型,可以是字符串、整数、字节、数组等
  1. QNAME (Query template NAME) string


  1. FLAG int

  1. RNAME (Reference sequence NAME of the alignment) string


  1. POS (1-based leftmost mapping POSition of the first matching base. ) int

  1. MAPQ (MAPping Quality) int

  1. CIGAR(CIGAR string) string


  1. RNEXT( Reference sequence name of the primary alignment of the NEXT read in the template) string


  1. PNEXT(Position of the primary alignment of the NEXT read in the template) int

  1. TLEN(signed observed Template LENgth.) int


  1. SEQ(segment SEQuence) string


  1. QUAL(ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format)) string


optional fields/可选字段

All optional fields follow the TAG:TYPE:VALUE format where TAG is a two-character string that matches
/[A-Za-z][A-Za-z0-9]/. Each TAG can only appear once in one alignment line. A TAG containing lowercase
letters are reserved for end users. In an optional field, TYPE is a single case-sensitive letter which defines the
format of VALUE:


Write a Comment