我有2个如下所示的pandas数据帧:
发生次数
doc 0 1 2 ... 1809(=n)
0 0 0 1 ... 1
1 0 0 1 ... 0
2 0 0 1 ... 0
.. .. .. .. ... .
m ......................... 0
词典
id term
0 foo
1 bar
2 lorem
.. ..
n ipsum
我想做的是针对“出现次数”的每一行,检索以“ 1”作为单元格值的术语(通过id,即第一个数据帧中的列标题)。 在我的示例中,考虑到出现的第一行,我将有:['lorem','ipsum']
谢谢
答案 0 :(得分:1)
这里是np.where
occurrences = pd.DataFrame([[0,0,1,1],[0,1,0,1], [1,0,1,0]])
dictionary=pd.DataFrame({'term':['foo','bar', 'lorem', 'ipsum']})
idx = np.where(occurrences)
(pd.Series(dictionary.values[idx[1]].ravel())
.groupby(idx[0]).agg(list)
)
输出:
0 [lorem, ipsum]
1 [bar, ipsum]
2 [foo, lorem]
dtype: object
答案 1 :(得分:0)
经过几次尝试,我使它以这种方式工作(也许不那么酷..)
def get_vocabulary(occurences, dictionary):
for index, row in dtm_.iterrows():
# iterate on each row == each document
doc = row.values.tolist() # convert row to list
ngrams = []
for i in range(len(doc)): # for each element
if doc[i] != 0:
ngrams.append(dictionary.iloc[i, 1]) # match from vocabulary the term with positional index
return ngrams
最终输出是:
['scheduling', 'distributed', 'deadline', .... , 'rate monotonic scheduling algorithm']