我有一个包含两列的pandas数据框。
df= pd.DataFrame({"C": ['this is orange','this is apple','this is pear','this is plum','this is orange'], "D": [0,0,0,0,0]})
我希望能够读取此C列并在D列中返回水果的名称。所以我的思维过程是使用df.C.str.contains来确定某个字符串是否出现在C的每一行中然后D相应地更新.C中的元素可能是非常长的字符串:ex。 "这是红色的苹果"但我只关心苹果这个词出现在单元格中。我应该注意到,我并没有使用str.contains,但这似乎是我最明显的途径。只是不确定我将如何应用它。
最终的数据框如下所示:
df= pd.DataFrame({"C": ['this is orange','this is apple','this is pear','this is plum','this is orange'], "D": ['orange','apple','pear','plum','grapefruit']})
答案 0 :(得分:1)
由于你没有具体说明水果的提取方式,我假设它总是先于“这是”;因此,以下内容应该有很长的路要走:
import pandas as pd
d = {'C': ['this is orange',
'this is apple',
'this is pear',
'this is plum',
'this is orange'],
'D': [0, 0, 0, 0, 0]}
dff = pd.DataFrame(d)
dff['D'] = dff.C.str.replace(r'(this is) ([A-Za-z]+)','\\2')
# or just
dff.C.str.replace('this is ','')
# C D
# 0 this is orange orange
# 1 this is apple apple
# 2 this is pear pear
# 3 this is plum plum
# 4 this is orange orange
这使用.str.replace
将“this is”替换为空字符串。
我希望这会有所帮助。
答案 1 :(得分:1)
考虑此数据框
df= pd.DataFrame({"C": ['this is orange','this is apple which is red','this is pear','this is plum','this is orange'], "D": [0,0,0,0,0]})
C D
0 this is orange 0
1 this is apple which is red 0
2 this is pear 0
3 this is plum 0
4 this is orange 0
您可以使用以下代码提取水果名称。确定水果的名称如下所示'
df['D'] = df.C.str.extract('this is ([A-Za-z]+)\s?.*?')
你得到了
C D
0 this is orange orange
1 this is apple which is red apple
2 this is pear pear
3 this is plum plum
4 this is orange orange
对于您发布的示例数据集,对空间进行简单拆分并提取最后一个元素
df['D'] = df.C.str.split(' ').str[-1]
答案 2 :(得分:1)
如果句子始终以this is
开头,后跟fruit name
,即如果第三个字始终为fruit name
,那么您还可以使用apply
和split()
函数,以便对每一行数据帧string
进行拆分,并将第三个结果用于替换列D
的值:
df['D'] = df['C'].apply(lambda val: val.split()[2])
或者如其他答案中所述只是split
函数,
df['D'] = df['C'].str.split().str[2]
输出:
C D
0 this is orange orange
1 this is apple apple
2 this is pear pear
3 this is plum plum
4 this is orange orange