haiferry的个人博客分享 http://blog.sciencenet.cn/u/haiferry

博文

Counting GC content

已有 2643 次阅读 2015-11-4 16:17 |系统分类:科研笔记

Computing GC Content

Problem

 

The GC content of a DNA string is given bythe percentage of symbols in the string that are 'C' or 'G'. For example, theGC content of "AGCTATAG" is 37.5%. Note that the reverse complementof any DNA string has the same GC content.

 

DNA strings must be labeled when they areconsolidated into a database. A commonly used method of string labeling iscalled FASTA format. In this format, the string is introduced by a line thatbegins with '>', followed by some labeling information. Subsequent linescontain the string itself; the first line to begin with '>' indicates thelabel of the next string.

 

In Rosalind's implementation, a string inFASTA format will be labeled by the ID "Rosalind_xxxx", where"xxxx" denotes a four-digit code between 0000 and 9999.

 

Given: At most 10 DNA strings in FASTAformat (of length at most 1 kbp each).

 

Return: The ID of the string having thehighest GC content, followed by the GC content of that string. Rosalind allowsfor a default error of 0.001 in all decimal answers unless otherwise stated;please see the note on absolute error below.

 

Sample Dataset

 

>Rosalind_6404

CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC

TCCCACTAATAATTCTGAGG

>Rosalind_5959

CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT

ATATCCATTTGTCAGCAGACACGC

>Rosalind_0808

CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC

TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

 

Rosalind_0808

60.919540

针对以上案例我采用以下的代码解决:

#!/usr/bin/python

f=open("data.txt",'r')

seqID=[]

line=[]

for i in f:

i.strip()

 ifi.startswith(">"):

    seqID.append(i)

else:

    line.append(i)

list=zip(seqID,line)

dic=dict((seqID,line)for seqID, line inlist)

a=dic.values()

i=0

c=0

value=[]

while i<len(a):

  for b in a[i]:

       if b=="G" or b=="C":

          c+=1

          d=float(c)

          e=d/len(a[i])

  value.append(e)

  c=0#重新起始化c值

  i+=1

h=dic.keys()

list1=zip(h,value)

dic2=dict((h,value) for h, value in list1)

dic3=sorted(dic2.items(),key=lambdadic2:dic2[1]) #对字典以值排序引用lambda匿名函数,冒号前为参                                                    数,冒号后为返回的值,其中sorted函数有两个参                                                    数dic2.items函数返回字典的键值对列表,key后                                                    面是要比较的选项,这里默认reverse=False

print dic3[len(dic3)-1][0]

print dic3[len(dic3)-1][1]




https://wap.sciencenet.cn/blog-2887147-933321.html

上一篇:DNA 的反向互补
下一篇:Counting Point Mutations
收藏 IP: 159.226.67.*| 热度|

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-17 21:05

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部