我有两个数据框,如下所示:
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea"]})
df2 = pd.DataFrame({"category":[1, 1, 2, 2, 3, 3],
"keywords":["cat", "dog", "birds", "bats", "coffee", "tea"]})
我的数据框看起来像这样
df1:
id string
01 This is a cat
02 That is a dog
03 Those are birds
04 These are bats
05 I drink coffee
06 I bought tea
df2:
category keywords
1 cat
1 dog
2 birds
2 bats
3 coffee
3 tea
我想在df1上有一个输出列,如果在df1的每个字符串中至少检测到df2中的一个关键字,则为类别。否则返回None。预期的输出应为以下内容。
id string category
01 This is a cat 1
02 That is a dog 1
03 Those are birds 2
04 These are bats 2
05 I drink coffee 3
06 I bought tea 3
我可以想到一个一个地循环遍历关键字,然后一个一个地遍历字符串,但是如果数据变大,效率就不够高。请问您有何改进建议?谢谢。
答案 0 :(得分:4)
# Modified your data a bit.
df1 = pd.DataFrame({"id":["01", "02", "03", "04", "05", "06", "07"],
"string":["This is a cat",
"That is a dog",
"Those are birds",
"These are bats",
"I drink coffee",
"I bought tea",
"This won't match squat"]})
您可以使用包含next
和默认参数的列表理解。
df1['category'] = [
next((c for c, k in df2.values if k in s), None) for s in df1['string']]
df1
id string category
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought tea 3.0
6 07 This won't match squat NaN
您无法避免O(N 2 )的复杂性,但这应该是相当有效的,因为它不必总是遍历内部循环中的每个字符串(除非在最坏的情况下)
请注意,该功能目前仅支持子字符串匹配(不支持基于正则表达式的匹配,尽管可以进行一些修改)。
答案 1 :(得分:3)
对split
使用列表推导,并按df2
创建的字典进行匹配:
d = dict(zip(df2['keywords'], df2['category']))
df1['cat'] = [next((d[y] for y in x.split() if y in d), None) for x in df1['string']]
print (df1)
id string cat
0 01 This is a cat 1.0
1 02 That is a dog 1.0
2 03 Those are birds 2.0
3 04 These are bats 2.0
4 05 I drink coffee 3.0
5 06 I bought thea NaN
答案 2 :(得分:2)
另一种易于理解的解决方案映射df1['string']
:
# create a dictionary with keyword->category pairs
cats = dict(zip(df2.keywords, df2.category))
def categorize(s):
for cat in cats.keys():
if cat in s:
return cats[cat]
# return 0 in case nothing is found
return 0
df1['category'] = df1['string'].map(lambda x: categorize(x))
print(df1)
id string category
0 01 This is a cat 1
1 02 That is a dog 1
2 03 Those are birds 2
3 04 These are bats 2
4 05 I drink coffee 3
5 06 I bought tea 3