如何将嵌套的dicts重新格式化为长格式的Pandas数据帧

时间:2017-09-25 09:06:49

标签: python pandas matplotlib

我正在尝试创建一个包含多个数据系列和类别的盒子图,所以like this

我拥有的数据是几个文件,每个文件都包含一个系列(例如' high'' low')。对于每个文件,我有几千行包含stringint的元组,例如

('HHFRVEHAVAEGAK', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('IKEEAVKEKSPSLGK', '3')
('ALLHTVTSILPAEPEAE', '2')
('VAVPTGPTPLDSTPPGGAPHPLTGQEEARAVEK', '5')

我想绘制这些序列中字符的出现分布。

class MyObj(object):

    __slots__ = ['name', 'seqs', 'charges']

    def __init__(self, name, tuples):
        self.name = name
        self.seqs = set()

        seqs, zs = zip(*tuples)
        self.seqs.update(seqs)
        #self.charges = collections.Counter(zs)
        self.charges = zs

data = {}
inf = ['high_corr.txt', 'low_corr.txt']
names = ['high', 'low']
for i, somefile in enumerate(inf):
    with open(somefile, 'r') as f:
        entries = [literal_eval(line.strip()) for line in f]
        index = names[i] if names else f"File{i}"
        data[index] = MyObj(index, entries)

    def getCounts(seq):
        c = collections.Counter(seq)
        return {aa: c[aa] for aa in seq}

    d = {name: [getCounts(s) for s in pc.seqs] for name, pc in data.items()} # <- tried dict comprehension as well
    df = pd.DataFrame.from_dict(d, orient='index')
    df = df.transpose()

所以当我读完文件时,我得到这样的东西: enter image description here

正如你所看到的,我无法将各个角色拿出来,它们被视为双语,因此不会被绘制。

有没有办法可以打破这些字母,并将它们作为第三列,就像链接问题中的示例一样?重申一下,我想要实现的是x轴上带有字母的箱形图,以及每个字母绘制的两个框(highlow)。

1 个答案:

答案 0 :(得分:0)

虽然我不确定这是否是最佳方式,但列表理解可能是一种可能性:

import string

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Simulate your data
d = {'high': [{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
              {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
              {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)}],
     'low': [{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
             {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
             {k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)}]}
df = pd.DataFrame(d)
print(df.head())

# “Unpivots” your data
l = [(col, letter, count) 
     for col, series in df.items() 
     for _, dd in series.to_dict().items() 
     for letter, count in dd.items()]
new_df = pd.DataFrame(l)
new_df.columns = ['variable', 'letter', 'count']
print(new_df.head())

# Boxplot with seaborn
sns.boxplot(x='letter',y='count',data=new_df,hue='variable')
plt.show()

对于你在这里描述的大问题,我认为如果你在制作DataFrame之前“忽略”它可能会更好,即在你评论的那一行使用列表理解而不是字典理解。我没有你的data。我只能猜测它可能是这样的:

d = [(name, letter, count)
     for name, pc in data.items()
     for s in pc.seqs
     for letter, count in getCounts(s)]
df = pd.DataFrame(d)
df.columns = ['variable', 'letter', 'count']