Question

我正在使用的程序输出了一个制表符分隔文件，如下所示：

marker A B C
Bin_1  1 2 1
marker C G H B T
Bin_2  3 1 1 1 2
marker B H T Z Y A C
Bin_3  1 1 2 1 3 4 5

我想修复它，看起来像这样：

marker A B C G H T Y Z
Bin_1  1 2 1 0 0 0 0 0
Bin_2  0 1 3 1 1 1 0 0
Bin_3  4 1 5 0 1 2 3 1

这是我到目前为止所拥有的

import pandas as pd 
from collections import OrderedDict
df = pd.read_csv('markers.txt',header=None,sep='\t')
x = map(list,df.values)
list_of_dicts = []
s = 0 
e =1
g = len(x)+1
while e < g:
    new_dict = OrderedDict(zip(x[s],x[e]))
    list_of_dicts.append(new_dict)
    s += 2
    e += 2

最初我将这些转换为字典，然后将进行某种计数并重新创建数据帧，但这似乎需要花费大量时间和内存才能完成一项简单的任务。有关更好的方法来解决这个问题的任何建议吗？

Answer 1

lines = [str.strip(l).split() for l in open('markers.txt').readlines()]
dicts = {b[0]: pd.Series(dict(zip(m[1:], b[1:])))
         for m, b in zip(lines[::2], lines[1::2])}
pd.concat(dicts).unstack(fill_value=0)

       A  B  C  G  H  T  Y  Z
Bin_1  1  2  1  0  0  0  0  0
Bin_2  0  1  3  1  1  2  0  0
Bin_3  4  1  5  0  1  2  3  1

Answer 2

深刻的是，当您“追加”DataFrames时，结果是一个DataFrame，其列是列的并集，NaN或其他任何内容。所以：

$ cat test.py
import pandas as pd

frame = pd.DataFrame()
with open('/tmp/foo.tsv') as markers:
    while True:
        line = markers.readline()
        if not line:
            break
        columns = line.strip().split('\t')
        data = markers.readline().strip().split('\t')
        new = pd.DataFrame(data=[data], columns=columns)
        frame = frame.append(new)

frame = frame.fillna(0)

print(frame)
$ python test.py < /tmp/foo.tsv
   A  B  C  G  H  T  Y  Z marker
0  1  2  1  0  0  0  0  0  Bin_1
0  0  1  3  1  1  2  0  0  Bin_2
0  4  1  5  0  1  2  3  1  Bin_3

如果您没有在其他任何地方使用熊猫，那么这可能（或可能不会）过度杀伤。但如果你已经在使用它，那么我认为这是完全合理的。

Answer 3

不是世界上最优雅的东西，但是......

headers = df.iloc[::2][0].apply(lambda x: x.split()[1:])
data = df.iloc[1::2][0].apply(lambda x: x.split()[1:])

result = []
for h, d in zip(headers.values, data.values):
    result.append(pd.Series(d, index=h))
pd.concat(result, axis=1).fillna(0).T

    A  B  C  G  H  T  Y  Z
0  1  2  1  0  0  0  0  0
1  0  1  3  1  1  2  0  0
2  4  1  5  0  1  2  3  1

Answer 4

为什么不在输入中操作数据到dict，然后构造DataFrame：

>>> with open(...) as f:
...     d = {}
...     for marker, bins in zip(f, f):
...         z = zip(h.split(), v.split())
...         _, bin = next(z)
...         d[bin] = dict(z)
>>> pd.DataFrame(d).fillna(0).T
       A  B  C  G  H  T  Y  Z
Bin_1  1  2  1  0  0  0  0  0
Bin_2  0  1  3  1  1  2  0  0
Bin_3  4  1  5  0  1  2  3  1

如果您确实需要列轴名称：

>>> pd.DataFrame(d).fillna(0).rename_axis('marker').T
marker  A  B  C  G  H  T  Y  Z
Bin_1   1  2  1  0  0  0  0  0
Bin_2   0  1  3  1  1  2  0  0
Bin_3   4  1  5  0  1  2  3  1

使用pandas来排序每两行输出的结果

4 个答案: