有一个水果列表,我想检查数据框中是否存在它们,以及是否存在(不管是哪一列),并指明它们。
import pandas as pd
Fruits = ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]
data = {'ID': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"],
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "guava and coconut", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "grape", "Lai fun", "Damson", "Liangpi", "Custard Apple and Crab apples", "Misua", "nana Coconut Berry", "Damson", "Paomo", "Ramen", "Rice vermicelli"]}
df = pd.DataFrame(data)
df = df[['ID', 'Content', 'Content_1']]
s = pd.Series(data['Content'])
s_1 = pd.Series(data['Content_1'])
df["found_content"] = s[s.str.contains('|'.join(Fruits))]
df["found_content_1"] = s_1[s_1.str.contains('|'.join(Fruits))]
writer = pd.ExcelWriter('C:\\TEM\\22522.xlsx')
df.to_excel(writer,'Sheet1', index = False)
writer.save()
代码的问题是:
如何实现?谢谢。
这是当前输出和所需输出的屏幕截图。
答案 0 :(得分:3)
将str.findall
与re.I
一起使用来忽略大小写,然后按str.join
加入列表:
import re
#\b for word boundary - general use
pat = r'(\b{}\b)'.format('|'.join(Fruits))
df["found_content"] = df['Content'].str.findall(pat, re.I).str.join(';')
df["found_content_1"] = df['Content_1'].str.findall(pat, re.I).str.join(';')
print (df)
ID Content Content_1 found_content \
0 488 Kalo Beruin Jook-sing noodles
1 14805 this is Blackberry grape Blackberry
2 23591 Khara Beruin Lai fun
3 470995 guava and coconut Damson guava;coconut
4 56251 Lapha Liangpi
5 85964 Loha Sura Custard Apple and Crab apples
6 5268 Matichak Misua
7 322624 Miniket Rice nana Coconut Berry
8 342225 Mou Beruin Damson
9 380689 Moulata Paomo
10 480562 oh Goji Berry Ramen Goji Berry
11 5623 purple Grape Rice vermicelli Grape
found_content_1
0
1 grape
2
3 Damson
4
5 Custard Apple;Crab apples
6
7 Coconut
8 Damson
9
10
11
另一种解决方案是使用title
代替re.I
:
pat = r'(\b{}\b)'.format('|'.join(Fruits))
df["found_content"] = df['Content'].str.title().str.findall(pat).str.join(';')
df["found_content_1"] = df['Content_1'].str.title().str.findall(pat).str.join(';')
print (df)
ID Content Content_1 found_content \
0 488 Kalo Beruin Jook-sing noodles
1 14805 this is Blackberry grape Blackberry
2 23591 Khara Beruin Lai fun
3 470995 guava and coconut Damson Guava;Coconut
4 56251 Lapha Liangpi
5 85964 Loha Sura Custard Apple and Crab apples
6 5268 Matichak Misua
7 322624 Miniket Rice nana Coconut Berry
8 342225 Mou Beruin Damson
9 380689 Moulata Paomo
10 480562 oh Goji Berry Ramen Goji Berry
11 5623 purple Grape Rice vermicelli Grape
found_content_1
0
1 Grape
2
3 Damson
4
5 Custard Apple;Crab Apples
6
7 Coconut
8 Damson
9
10
11
答案 1 :(得分:0)
也许是这样:
import pandas as pd
Fruits = ["Avocado", "Blackberry", "Black Sapote", "Fingered Citron", "Crab Apples", "Custard Apple", "Chico Fruit", "Coconut", "Damson", "Elderberry", "Goji Berry", "Grape", "Guava", "Huckleberry"]
data = {'ID': ["488", "14805", "23591", "470995", "56251", "85964", "5268", "322624", "342225", "380689", "480562", "5623"],
'Content' : ["Kalo Beruin", "this is Blackberry", "Khara Beruin", "guava and coconut", "Lapha", "Loha Sura", "Matichak", "Miniket Rice", "Mou Beruin", "Moulata", "oh Goji Berry", "purple Grape"],
'Content_1' : ["Jook-sing noodles", "grape", "Lai fun", "Damson", "Liangpi", "Custard Apple and Crab apples", "Misua", "nana Coconut Berry", "Damson", "Paomo", "Ramen", "Rice vermicelli"]}
df = pd.DataFrame(data)
df["found_content"] = df['Content'].str.extract('(?P<Fruits>{})'.format("|".join(Fruits)), expand=True).fillna('')
df["found_content_1"] = df['Content_1'].str.extract('(?P<Fruits>{})'.format("|".join(Fruits)), expand=True).fillna('')
writer = pd.ExcelWriter('filename.xlsx')
df.to_excel(writer,'Sheet1', index = False)
writer.save()