Question

假设我在text中存储了一个字符串。我想将此字符串与数据帧中存储的字符串列表进行比较，并检查| topic | keywords | |------------|-------------------------------------------| | Vehicles | [car, plane, motorcycle, bus] | | Electronic | [television, radio, computer, smartphone] | | Fruits | [apple, orange, grape] |是否包含car，plane等单词。对于找到的每个关键字，我想添加1个属于相关主题的值。

def foo(text, df_lex):

    keyword = []
    score = []
    for lex_list in df_lex['keyword']:
        print(lex_list)
        val = 0
        for lex in lex_list:

            if lex in text:
                val =+ 1
        keyword.append(key)
        score.append(val)
    score_list = pd.DataFrame({
    'keyword':keyword,
    'score':score
    })

我写了下面的代码，但是我不太喜欢。而且它没有按预期工作。

motorcycle

有没有办法有效地做到这一点？我不喜欢我的程序中有太多循环，因为它们看起来效率不高。如果需要，我将详细说明。谢谢。

编辑：例如，我的文字是这样的。我说得很简单，就是为了它可以理解。

我今天乘car去陈列室买了smartphone。不幸的是，当我检查| topic | score | |------------|-------| | Vehicles | 2 | | Electronic | 1 | | Fruits | 0 |时，收到一条消息要回家。

所以，我的预期输出将是这样的。

df['keywords'] = df['keywords'].str.strip('[]').str.split(', ')

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

score_list = []
for lex in df['keywords']:
    val = 0
    for w in lex:
        if w in text:
            val +=1
    score_list.append(val)
df['score'] = score_list
print(df)

EDIT2：我终于在@jezrael的帮助下找到了自己的解决方案。

{{1}}

它完全打印出我需要的东西。

Answer 1

用re.findall提取单词，先转换为小写字母，再转换为set s，最后获得列表理解中匹配集的长度：

df = pd.DataFrame({'topic': ['Vehicles', 'Electronic', 'Fruits'], 'keywords': [['car', 'plane', 'motorcycle', 'bus'], ['television', 'radio', 'computer', 'smartphone'], ['apple', 'orange', 'grape']]})

text = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'

import re
s = set(x.lower() for x in re.findall(r'\b\w+\b', text))
print (s)
{'go', 'motorcycle', 'a', 'car', 'my', 'the', 'got', 
 'message', 'to', 'home', 'went', 'riding', 'checked', 
 'i', 'showroom', 'when', 'buy', 'smartphone', 'today', 'unluckily'}

df['score'] = [len(s & set(x)) for x in df['keywords']]
print (df)
        topic                                   keywords  score
0    Vehicles              [car, plane, motorcycle, bus]      2
1  Electronic  [television, radio, computer, smartphone]      1
2      Fruits                     [apple, orange, grape]      0

另一种解决方案是仅在列表理解中计算True个值：

df['score'] = [sum(z in text.split() for z in x) for x in df['keywords']]

Answer 2

这里有2种仅使用香草python的替代方法。首先是感兴趣的数据。

kwcsv = """topic, keywords
Vehicles, car, plane, motorcycle, bus
Electronic, television, radio, computer, smartphone
Fruits, apple, orange, grape
"""

test = 'I went to the showroom riding a motorcycle to buy a car today. Unluckily, when I checked my smartphone, I got a message to go home.'
testr = test
from io import StringIO

StringIO仅用于创建可运行的示例，它象征着读取文件。然后构造一个kwords字典用于计数。

import csv

kwords = dict()
#with open('your_file.csv') as mcsv:
mcsv = StringIO(kwcsv)
reader = csv.reader(mcsv, skipinitialspace=True)
next(reader, None) # skip header
for row in reader:
    kwords[row[0]] = tuple(row[1:])

现在我们有什么要在字典中计数的。第一种选择是只对文本字符串进行计数。

for r in list('.,'): # remove chars that removes counts
    testr = testr.replace(r, '')

result = {k: sum((testr.count(w) for w in v)) for k, v in kwords.items()}

或其他使用正则表达式拆分字符串和计数器的版本。

import re
from collections import Counter

words = re.findall(r'\w+', StringIO(test).read().lower())
count = Counter(words)

result2 = {k: sum((count[w] for w in v)) for k, v in kwords.items()}

并不是说这些中的任何一个都是更好的，只是仅使用香草python的替代品。我个人会使用re/Counter版本。

如何使用熊猫将字符串与数据框中的字符串进行比较？

2 个答案: