我从细菌群落测序中得到了这些数据。 我知道一些基本的Python,我正在完成codecademy教程。 出于实际目的,请将OTU视为"物种"的另一个词。
以下是原始数据的示例:
OTU ID OTU Sum Lineage
591820 1083 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
532752 517 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
218456 346 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__Bordetella; s__
590248 330 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__; s__
343284 321 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Limnohabitans; s__
数据包括三件事:物种的参考编号,样本中物种的数量,以及所述物种的分类。
我尝试做的是将所有时间加起来为分类家族找到序列(在数据中指定为f_x
)。
以下是所需输出的示例:
f__Fusobacteriaceae 1600
f__Alcaligenaceae 676
f__Comamonadaceae 321
这不是一个班级。几个月前我开始学习python,所以我至少能够查找任何建议。我知道如何通过缓慢的方式(复制和粘贴excel)来实现它,所以这是为了将来的参考。
答案 0 :(得分:1)
如果文件中的行真的如此,你可以
from collections import defaultdict
import re
nums = defaultdict(int)
with open("file.txt") as f:
for line in f:
items = line.split(None, 2) # Split twice on any whitespace
if items[0].isdigit():
key = re.search(r"f__\w+", items[2]).group(0)
nums[key] += int(items[1])
结果:
>>> nums
defaultdict(<type 'int'>, {'f__Comamonadaceae': 321, 'f__Fusobacteriaceae': 1600,
'f__Alcaligenaceae': 676})
答案 1 :(得分:1)
另一种解决方案,使用collections.Counter
:
from collections import Counter
counter = Counter()
with open('data.txt') as f:
# skip header line
next(f)
for line in f:
# Strip line of extraneous whitespace
line = line.strip()
# Only process non-empty lines
if line:
# Split by consecutive whitespace, into 3 chunks (2 splits)
otu_id, otu_sum, lineage = line.split(None, 2)
# Split the lineage tree into a list of nodes
lineage = [node.strip() for node in lineage.split(';')]
# Extract family node (assuming there's only one)
family = [node for node in lineage if node.startswith('f__')][0]
# Increase count for this family by `otu_sum`
counter[family] += int(otu_sum)
for family, count in counter.items():
print "%s %s" % (family, count)
有关None
参数的详细信息(匹配连续的空格),请参阅str.split()
的文档。
答案 2 :(得分:0)
获取所有原始数据并首先处理它,我的意思是构建它,然后使用结构化数据来执行您想要的任何操作。 如果您拥有GB的数据,则可以使用elasticsearch。在这种情况下输入您的原始数据和查询所需的字符串f_ *并获取所有条目并添加它们
答案 3 :(得分:0)
基本的python非常可行。将Library Reference放在枕头下面,因为您经常要引用它。
你可能最终会做类似这样的事情(我会用更长久以来更可读的方式来编写它 - 有压缩代码的方法,并且更快地完成这项工作)。< / p>
# Open up a file handle
file_handle = open('myfile.txt')
# Discard the header line
file_handle.readline()
# Make a dictionary to store sums
sums = {}
# Loop through the rest of the lines
for line in file_handle.readlines():
# Strip off the pesky newline at the end of each line.
line = line.strip()
# Put each white-space delimited ... whatever ... into items of a list.
line_parts = line.split()
# Get the first column
reference_number = line_parts[0]
# Get the second column, convert it to an integer
sum = int(line_parts[1])
# Loop through the taxonomies (the rest of the 'columns' separated by whitespace)
for taxonomy in line_parts[2:]:
# skip it if it doesn't start with 'f_'
if not taxonomy.startswith('f_'):
continue
# remove the pesky semi-colon
taxonomy = taxonomy.strip(';')
if sums.has_key(taxonomy):
sums[taxonomy] += int(sum)
else:
sums[taxonomy] = int(sum)
# All done, do some fancy reporting. We'll leave sorting as an exercise to the reader.
for taxonomy in sums.keys():
print("%s %d" % (taxonomy, sums[taxonomy]))