我有一个包含两列的数据框,message_id和msg_lower。我也有一个称为条件的关键字列表。我的目标是在msg_lower字段中搜索术语列表中的所有单词。如果它们匹配,我想返回一个包含message_id和关键字的元组。
数据如下:
|message_id|msg_lower |
|1116193453|text here that means something |
|9023746237|more text there meaning nothing|
terms = [text, nothing, there meaning]
术语也可以长于一个单词
对于给定的示例,我想返回:
[(1116193453, text),(9023746237,text),(9023746237,nothing),(9023746237,there meaning)]
理想情况下,我想尽可能有效地做到这一点
答案 0 :(得分:1)
您可以将两列都压缩为可能的元组循环,按术语循环以及测试是否为拆分值成员:
terms = ['text', 'nothing']
a = [(x,i) for x, y in zip(df['message_id'],df['msg_lower']) for i in terms if i in y.split()]
print (a)
[(1116193453, 'text'), (9023746237, 'text'), (9023746237, 'nothing')]
编辑:
terms = ['text', 'nothing', 'there meaning']
a = [(x, i) for x, y in zip(df['message_id'],df['msg_lower']) for i in terms if i in y]
print (a)
[(1116193453, 'text'), (9023746237, 'text'),
(9023746237, 'nothing'), (9023746237, 'there meaning')]
另一个想法是将findall
与单词边界一起使用以提取值:
a = [(x, i) for x, y in zip(df['message_id'],df['msg_lower'])
for i in terms if re.findall(r"\b{}\b".format(i), y)]
答案 1 :(得分:0)
list(df.apply(lambda x: [(i, x['message_id']) for i in re.findall('|'.join(terms),x['msg_lower'])], axis=1).apply(pd.Series).stack())
输出
[('text', 1116193453), ('text', 9023746237), ('nothing', 9023746237)]
答案 2 :(得分:0)
如果您的关键字只是单词(不包含空格),则可以使用集合。我不知道您的数据是如何存储的,使用二维数组,它可以像这样工作:
data = [["1116193453", "text here that means something"],
["9023746237", "more text there meaning nothing"]]
terms = {"text", "nothing"}
matches = []
for row in data:
for word in set(row[1].split()) & terms:
matches.append((row[0], word))
print(matches)
# [('1116193453', 'text'), ('9023746237', 'text'), ('9023746237', 'nothing')]