我正在尝试创建一个包含多个数据系列和类别的盒子图,所以like this:
我拥有的数据是几个文件,每个文件都包含一个系列(例如' high'' low')。对于每个文件,我有几千行包含string
和int
的元组,例如
('HHFRVEHAVAEGAK', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('MPHGYDTQVGER', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('KYNYVAMDTEFPGVVARPIGEFR', '3')
('IKEEAVKEKSPSLGK', '3')
('ALLHTVTSILPAEPEAE', '2')
('VAVPTGPTPLDSTPPGGAPHPLTGQEEARAVEK', '5')
我想绘制这些序列中字符的出现分布。
class MyObj(object):
__slots__ = ['name', 'seqs', 'charges']
def __init__(self, name, tuples):
self.name = name
self.seqs = set()
seqs, zs = zip(*tuples)
self.seqs.update(seqs)
#self.charges = collections.Counter(zs)
self.charges = zs
data = {}
inf = ['high_corr.txt', 'low_corr.txt']
names = ['high', 'low']
for i, somefile in enumerate(inf):
with open(somefile, 'r') as f:
entries = [literal_eval(line.strip()) for line in f]
index = names[i] if names else f"File{i}"
data[index] = MyObj(index, entries)
def getCounts(seq):
c = collections.Counter(seq)
return {aa: c[aa] for aa in seq}
d = {name: [getCounts(s) for s in pc.seqs] for name, pc in data.items()} # <- tried dict comprehension as well
df = pd.DataFrame.from_dict(d, orient='index')
df = df.transpose()
正如你所看到的,我无法将各个角色拿出来,它们被视为双语,因此不会被绘制。
有没有办法可以打破这些字母,并将它们作为第三列,就像链接问题中的示例一样?重申一下,我想要实现的是x轴上带有字母的箱形图,以及每个字母绘制的两个框(high
和low
)。
答案 0 :(得分:0)
虽然我不确定这是否是最佳方式,但列表理解可能是一种可能性:
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Simulate your data
d = {'high': [{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)}],
'low': [{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)},
{k: v for k in string.ascii_uppercase for v in np.random.randint(1, 27, size=27)}]}
df = pd.DataFrame(d)
print(df.head())
# “Unpivots” your data
l = [(col, letter, count)
for col, series in df.items()
for _, dd in series.to_dict().items()
for letter, count in dd.items()]
new_df = pd.DataFrame(l)
new_df.columns = ['variable', 'letter', 'count']
print(new_df.head())
# Boxplot with seaborn
sns.boxplot(x='letter',y='count',data=new_df,hue='variable')
plt.show()
对于你在这里描述的大问题,我认为如果你在制作DataFrame之前“忽略”它可能会更好,即在你评论的那一行使用列表理解而不是字典理解。我没有你的data
。我只能猜测它可能是这样的:
d = [(name, letter, count)
for name, pc in data.items()
for s in pc.seqs
for letter, count in getCounts(s)]
df = pd.DataFrame(d)
df.columns = ['variable', 'letter', 'count']