haiferry的个人博客分享 http://blog.sciencenet.cn/u/haiferry

博文

python Consensus and Profile

已有 3183 次阅读 2015-11-24 21:26 |系统分类:科研笔记

A matrix is arectangular table of values divided into rows and columns. An m×n matrixhas m rows and n columns. Given a matrix A, wewrite Ai,j to indicate the value found at the intersection ofrow i and column j.

Say that we have a collection of DNA strings, all having thesame length n. Their profile matrix isa 4×n matrix P inwhich P1,j represents the number of times that 'A' occurs inthe jthposition ofone of the strings, P2,j represents the number of times that C occursin the jth position, and so on (see below).

A consensus string c isa string of length n formed from our collection by taking the mostcommon symbol at each position; the jth symbol of c thereforecorresponds to the symbol having the maximum value in the j-th column ofthe profile matrix. Of course, there may be more than one most common symbol,leading to multiple possible consensus strings.


A T C C A G C T


G G G C A A C T


A T G G A T C T

DNA Strings

A A G C A A C C


T T G G A A C T


A T G C C A T T


A T G G C A C T


A    5 1 0 0 5 5 0 0

Profile

C    0 0 1 4 2 0 6 1


G    1 1 6 3 0 1 0 0


T    1 5 0 0 0 1 1 6

Consensus

A T G C A A C T

Given: Acollection of at most 10 DNA strings of equallength (at most 1 kbp)in FASTA format.

Return: Aconsensus string and profile matrix for the collection. (If several possibleconsensus strings exist, then you may return any one of them.)

Sample Dataset

>Rosalind_1

ATCCAGCT

>Rosalind_2

GGGCAACT

>Rosalind_3

ATGGATCT

>Rosalind_4

AAGCAACC

>Rosalind_5

TTGGAACT

>Rosalind_6

ATGCCATT

>Rosalind_7

ATGGCACT

Sample Output

ATGCAACT

A: 5 1 0 0 5 5 0 0

C: 0 0 1 4 2 0 6 1

G: 1 1 6 3 0 1 0 0

T: 1 5 0 0 0 1 1 6

针对以上案例,我写了如下代码:

#!/usr/bin/python

f=open("data1.txt","r")

f1=f.readlines()

titleID=[]

sequence=[]

for i in f1:

  i.strip()

  if i.startswith(">"):

      titleID.append(i)

  else:

      sequence.append(i)

b=zip(*sequence)

j=0

g=0

h=0

k=0

l=0

A=[]

C=[]

G=[]

T=[]

while j<len(b):

  for c in b[j]:

     if c=='A':

        g+=1

     elif c=='C':

        h+=1

     elif c=='G':

        k+=1

     else:

        l+=1

  A.append(g)

  C.append(h)

  G.append(k)

  T.append(l)

  g=0

  h=0

  k=0

  l=0

  j+=1

H=zip(A,C,G,T)

consensus=[]

m=0

while m<len(H):

  if max(H[m])==H[m][0]:

      consensus.append('A')

  elif max(H[m])==H[m][1]:

       consensus.append('C')

  elif max(H[m])==H[m][2]:

       consensus.append('G')

  elif max(H[m])==H[m][3]:

       consensus.append('T')

  m+=1

print consensus

print 'A: ', A

print 'G: ', G

print 'C: ', C

print 'T: ', T




https://wap.sciencenet.cn/blog-2887147-938260.html

上一篇:R语言做散点图
下一篇:python 字符串替换
收藏 IP: 159.226.67.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-3-29 05:33

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部