我有很长的词汇表单词,想检查一个段落中是否包含词汇表,并将1标记为是,0表示否,简化如下:
>>> glossary = ['phrase 1', 'phrase 2', 'phrase 3']
>>> glossary
['phrase 1', 'phrase 2', 'phrase 3']
>>> df= pd.DataFrame(['This is a phrase 1 and phrase 2', 'phrase 1',
'phrase 3', 'phrase 1 & phrase 2. phrase 3 as well'],columns=['text'])
>>> df
text
0 This is a phrase 1 and phrase 2
1 phrase 1
2 phrase 3
3 phrase 1 & phrase 2. phrase 3 as well
将其连接如下:
text phrase 1 phrase 2 phrase 3
0 This is a phrase 1 and phrase 2 NaN NaN NaN
1 phrase 1 NaN NaN NaN
2 phrase 3 NaN NaN NaN
3 phrase 1 & phrase 2. phrase 3 as well NaN NaN NaN
我希望为每个词汇表列实现与文本列的比较,如果词汇表在文本中则更新1,如果没有则更新0,在这种情况下它将是
text phrase 1 phrase 2 phrase 3
0 This is a phrase 1 and phrase 2 1 1 0
1 phrase 1 1 0 0
2 phrase 3 0 0 1
3 phrase 1 & phrase 2. phrase 3 as well 1 1 1
你能告诉我怎样才能实现它?鉴于在我的数据框中,词汇表列大约有3000列,因此我还想概括逻辑,使其基于列标签作为比较每行中相应文本的键。
答案 0 :(得分:2)
您可以将str.contains
和concat
的列表理解与int
数据框的0,1
一起使用:
L = [df['text'].str.contains(x) for x in glossary]
df1 = pd.concat(L, axis=1, keys=glossary).astype(int)
print (df1)
phrase 1 phrase 2 phrase 3
0 1 1 0
1 1 0 0
2 0 0 1
3 1 1 1
然后join
原创:
df = df.join(df1)
print (df)
text phrase 1 phrase 2 phrase 3
0 This is a phrase 1 and phrase 2 1 1 0
1 phrase 1 1 0 0
2 phrase 3 0 0 1
3 phrase 1 & phrase 2. phrase 3 as well 1 1 1