Question

我有以下数据框：

import pandas as pd

df = pd.DataFrame({
    "ClusterID" : [1,2,2,1,3],
    "Genes" : ['foo','qux','bar','cux','fii'],
})

看起来像这样：

  ClusterID Genes
0          1   foo
1          2   qux
2          2   bar
3          1   cux
4          3   fii

我想要做的是将它们转换为列表字典：

{ '1': ['foo','cux'],
  '2': ['qux','bar'],
  '3': ['fii']}

我该怎么做？

Answer 1

您可以使用groupby和apply tolist，然后使用Series.to_dict：

import pandas as pd

df = pd.DataFrame({
    "ClusterID" : [1,2,2,1,3],
    "Genes" : ['foo','qux','bar','cux','fii'],
})
print df
   ClusterID Genes
0          1   foo
1          2   qux
2          2   bar
3          1   cux
4          3   fii

s = df.groupby('ClusterID')['Genes'].apply(lambda x: x.tolist())
print s
ClusterID
1    [foo, cux]
2    [qux, bar]
3         [fii]
Name: Genes, dtype: object

print s.to_dict()
{1: ['foo', 'cux'], 2: ['qux', 'bar'], 3: ['fii']}

Answer 2

dct = {x:df.Genes[df.ClusterID == x].tolist() for x in set(df.ClusterID)}
# dct == {1: ['foo','cux'], 2: ['qux','bar'], 3: ['fii']}

由于您的 ClusterID 列包含整数值，因此您的字典键也是如此。如果您希望键是字符串，只需使用str函数

dct = {str(x):df.Genes[df.ClusterID == x].tolist() for x in set(df.ClusterID)}

这里我们使用字典理解语句。表达式set(df.ClusterID)将为我们提供该列中的一组唯一值（我们可以使用一个集合，因为字典键无论如何都是无序的）。 df.Genes[df.ClusterID == x]会在基因列中为我们提供与 ClusterID 值等于x的行对应的值。使用tolist()会将pandas.Series返回到列表中。

因此，此字典表达式循环遍历 ClusterID 列中的每个唯一值，并将与该值对应的 Genes 值列表作为列表存储在该字典下键。

将两列Pandas数据帧转换为列表字典，并将第一列作为键

2 个答案: