Python搜索文本列,如果单词列表中有匹配的关键字,则返回

时间:2019-05-21 13:27:17

标签: python pandas

我有一个包含两列的数据框,message_id和msg_lower。我也有一个称为条件的关键字列表。我的目标是在msg_lower字段中搜索术语列表中的所有单词。如果它们匹配,我想返回一个包含message_id和关键字的元组。

数据如下:

|message_id|msg_lower                      |
|1116193453|text here that means something |
|9023746237|more text there meaning nothing|
terms = [text, nothing, there meaning]

术语也可以长于一个单词

对于给定的示例,我想返回:

[(1116193453, text),(9023746237,text),(9023746237,nothing),(9023746237,there meaning)]

理想情况下,我想尽可能有效地做到这一点

3 个答案:

答案 0 :(得分:1)

您可以将两列都压缩为可能的元组循环,按术语循环以及测试是否为拆分值成员:

terms = ['text', 'nothing']
a = [(x,i) for x, y in zip(df['message_id'],df['msg_lower']) for i in terms if i in y.split()]
print (a)
[(1116193453, 'text'), (9023746237, 'text'), (9023746237, 'nothing')]

编辑:

terms = ['text', 'nothing', 'there meaning']

a = [(x, i) for x, y in zip(df['message_id'],df['msg_lower']) for i in terms if i in y]
print (a)
[(1116193453, 'text'), (9023746237, 'text'), 
 (9023746237, 'nothing'), (9023746237, 'there meaning')]

另一个想法是将findall与单词边界一起使用以提取值:

a = [(x, i) for x, y in zip(df['message_id'],df['msg_lower']) 
            for i in terms if re.findall(r"\b{}\b".format(i), y)]

答案 1 :(得分:0)

list(df.apply(lambda x: [(i, x['message_id']) for i in re.findall('|'.join(terms),x['msg_lower'])], axis=1).apply(pd.Series).stack())

输出

[('text', 1116193453), ('text', 9023746237), ('nothing', 9023746237)]

答案 2 :(得分:0)

如果您的关键字只是单词(不包含空格),则可以使用集合。我不知道您的数据是如何存储的,使用二维数组,它可以像这样工作:

data = [["1116193453", "text here that means something"],
        ["9023746237", "more text there meaning nothing"]]
terms = {"text", "nothing"}

matches = []
for row in data:
    for word in set(row[1].split()) & terms:
        matches.append((row[0], word))

print(matches)
# [('1116193453', 'text'), ('9023746237', 'text'), ('9023746237', 'nothing')]