熊猫从数据框中找到与列表匹配的所有单词

时间:2020-11-04 06:26:53

标签: python regex pandas

我对情感(愤怒,恐惧,期待,信任等)有一个与情感相关的词语

期望列表:

{'anticipation': ['abundance',
          'opera',
          'star',
          'start',
          'achievement',
          'acquiring',...]

而且,我有一个带有成行句子的数据框。我想找到与情感相关的单词

| text                          |
|---------------------------    |
| operation start yesterday     |
| operation start now           |
| operation halt                |

预期产量

| text                          | result        |
|---------------------------    |-------------  |
| operation start yesterday     | start         |
| operation start now           | start         |
| operation achievement         | achievement   |

我尝试过

df['result']=df["text"].str.findall(r"\b"+"|".join(anticipationlist) +r"\b").apply(", ".join)

我的结果是

| text                          | result                |
|---------------------------    |--------------------   |
| operation start yesterday     | opera, star           |
| operation start now           | opera, star           |
| operation achievement         | opera, achievement    |

如何改进代码以获得所需的结果?

2 个答案:

答案 0 :(得分:1)

您可以为每个值分别添加单词边界:

pat = '|'.join(r"\b{}\b".format(x) for x in anticipationlist)
df['result']=df["text"].str.findall(pat).apply(", ".join)

print (df)
                        text       result
0  operation start yesterday        start
1        operation start now        start
2      operation achievement  achievement

答案 1 :(得分:0)

这是一种不使用正则表达式的方法。另外,我将您的anticipationlistdict更改为list

import pandas as pd

anticipationlist= ['abundance',
                    'opera',
                    'star',
                    'start',
                    'achievement',
                    'acquiring',
                    ]

values = [
    'operation start yesterday',
    'operation start now',
    'operation achievement',
    ]
df = pd.DataFrame(data=values, columns=['text'])

def find_values(x):
    results = []
    for value in anticipationlist:
        for word in x.split():
            if word == value:
                results.append(word)
    return ' '.join(results)
df['result'] = df['text'].apply(lambda x: find_values(x))

print(df.head())
相关问题