如何从2D字典将其转换为数据框或存在/不存在的矩阵,其中列表中的值是列,键是行名? 累积列表中的值,我的目标是将它们组织成矩阵。
我一直在尝试,但是没有成功:
values = set()
for genome, info in dict_cluster.items():
for v in info:
#t = [genome, ([v for v in info])]
t = [genome,v]
print pd.DataFrame(t)
输入:
A ['arylpolyene', 'hserlactone', 'hserlactone', 'nrps', 'siderophore', 't1pks-nrps', 'terpene', 'thiopeptide', 'transatpks-nrps']
B ['hserlactone', 'hserlactone-arylpolyene', 'nrps', 'siderophore', 'thiopeptide']
C ['nrps', 'nrps', 'nrps', 'siderophore', 't1pks-nrps', 't1pks-nrps']
D ['nrps', 'siderophore', 't1pks-nrps']
输出:
arylpolyene siderophore hserlactone-arylpolyene transatpks-nrps terpene thiopeptide hserlactone nrps t1pks-nrps
A 1 2 0 1 1 1 2 1 1
B 0 1 1 0 0 1 1 1 0
C 0 1 0 0 0 0 0 3 2
D 0 1 0 0 0 0 0 1 1
我的输出是这样:
0
0 GCF_900068895.1
1 transatpks-nrps
0
0 GCA_002415165.1
1 thiopeptide
0
0 GCA_000367685.2
1 t1pks-nrps
0
0 GCA_002732135.1
1 t1pks-nrps
答案 0 :(得分:1)
将Counter
与dictionary comprehension
一起使用并分配给DataFrame
:
from collections import Counter
df = pd.DataFrame({k:Counter(v) for k, v in d.items()}).T.fillna(0).astype(int)
print (df)
arylpolyene hserlactone hserlactone-arylpolyene nrps siderophore \
A 1 2 0 1 1
B 0 1 1 1 1
C 0 0 0 3 1
D 0 0 0 1 1
t1pks-nrps terpene thiopeptide transatpks-nrps
A 1 1 1 1
B 0 0 1 0
C 2 0 0 0
D 1 0 0 0
编辑:
对于指标值,请使用MultiLabelBinarizer
:
d = {'A': ['arylpolyene', 'hserlactone', 'hserlactone', 'nrps', 'siderophore', 't1pks-nrps', 'terpene', 'thiopeptide', 'transatpks-nrps'],
'B': ['hserlactone', 'hserlactone-arylpolyene', 'nrps', 'siderophore', 'thiopeptide'],
'C' :['nrps', 'nrps', 'nrps', 'siderophore', 't1pks-nrps', 't1pks-nrps'],
'D': ['nrps', 'siderophore', 't1pks-nrps']}
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(d.values()),columns=mlb.classes_, index=d.keys())
print (df)
arylpolyene hserlactone hserlactone-arylpolyene nrps siderophore \
A 1 1 0 1 1
B 0 1 1 1 1
C 0 0 0 1 1
D 0 0 0 1 1
t1pks-nrps terpene thiopeptide transatpks-nrps
A 1 1 1 1
B 0 0 1 0
C 1 0 0 0
D 1 0 0 0
答案 1 :(得分:1)
也许您正在寻找这样的东西:
val = {'A': ['arylpolyene', 'hserlactone', 'hserlactone', 'nrps', 'siderophore', 't1pks-nrps', 'terpene', 'thiopeptide', 'transatpks-nrps'],
'B': ['hserlactone', 'hserlactone-arylpolyene', 'nrps', 'siderophore', 'thiopeptide'],
'C': ['nrps', 'nrps', 'nrps', 'siderophore', 't1pks-nrps', 't1pks-nrps'],
'D': ['nrps', 'siderophore', 't1pks-nrps']}
all_val = []
for k in val:
for v in val[k]:
all_val.append((k,v))
df = pd.DataFrame(all_val,columns=['key','val']).set_index('key')
df_count = df.pivot_table(index='key',columns='val',aggfunc=len)
输出:
val arylpolyene hserlactone hserlactone-arylpolyene nrps siderophore \
key
A 1.0 2.0 NaN 1.0 1.0
B NaN 1.0 1.0 1.0 1.0
C NaN NaN NaN 3.0 1.0
D NaN NaN NaN 1.0 1.0
val t1pks-nrps terpene thiopeptide transatpks-nrps
key
A 1.0 1.0 1.0 1.0
B NaN NaN 1.0 NaN
C 2.0 NaN NaN NaN
D 1.0 NaN NaN NaN
答案 2 :(得分:0)
这应该可以完成您的工作(我正在使用Python3):
my_dict = {
'A': ['arylpolyene', 'hserlactone', 'hserlactone', 'nrps', 'siderophore', 't1pks-nrps', 'terpene', 'thiopeptide', 'transatpks-nrps'],
'B': ['hserlactone', 'hserlactone-arylpolyene', 'nrps', 'siderophore', 'thiopeptide'],
'C': ['nrps', 'nrps', 'nrps', 'siderophore', 't1pks-nrps', 't1pks-nrps'],
'D': ['nrps', 'siderophore', 't1pks-nrps']
}
rows_list=list(my_dict.keys())
values=list(my_dict.values())
rows_size=len(rows_list)
columns_list = []
for sublist in values:
for item in sublist:
if item not in columns_list:
columns_list.append(item)
columns_size = len(columns_list)
#initialize adjacent matrix
print('Initial adjacent matrix')
adjacent = [ [0]*columns_size for i in range(rows_size) ]
for row in adjacent:
print(row)
for key, value in my_dict.items():
for v in value:
adjacent[rows_list.index(key)][columns_list.index(v)] += 1
print('-'*50)
print('Final adjacent matrix')
for row in adjacent:
print(row)
在第一个循环for sublist in values:
中,我创建了一个列表,其中包含您想要的值作为没有重复的列。
在adjacent = [ [0]*columns_size for i in range(rows_size) ]
中,我创建了一个列表,其中的元素与字典键的数量一样多。每个元素都是一个列表,其中元素的数量与列值的数量一样。
我试图做得尽可能简单,告诉我是否有一些你不知道的事情:)