使用df.column.str.contains并更新pandas dataframe列

时间:2017-06-16 16:28:21

标签: python regex pandas

我有一个包含两列的pandas数据框。

df= pd.DataFrame({"C": ['this is orange','this is apple','this is pear','this is plum','this is orange'], "D": [0,0,0,0,0]})

我希望能够读取此C列并在D列中返回水果的名称。所以我的思维过程是使用df.C.str.contains来确定某个字符串是否出现在C的每一行中然后D相应地更新.C中的元素可能是非常长的字符串:ex。 "这是红色的苹果"但我只关心苹果这个词出现在单元格中。我应该注意到,我并没有使用str.contains,但这似乎是我最明显的途径。只是不确定我将如何应用它。

最终的数据框如下所示:

df= pd.DataFrame({"C": ['this is orange','this is apple','this is pear','this is plum','this is orange'], "D": ['orange','apple','pear','plum','grapefruit']})

3 个答案:

答案 0 :(得分:1)

由于你没有具体说明水果的提取方式,我假设它总是先于“这是”;因此,以下内容应该有很长的路要走:

import pandas as pd

d = {'C': ['this is orange',
  'this is apple',
  'this is pear',
  'this is plum',
  'this is orange'],
 'D': [0, 0, 0, 0, 0]}

dff = pd.DataFrame(d)

dff['D'] = dff.C.str.replace(r'(this is) ([A-Za-z]+)','\\2')
# or just
dff.C.str.replace('this is ','')


#                 C       D
# 0  this is orange  orange
# 1   this is apple   apple
# 2    this is pear    pear
# 3    this is plum    plum
# 4  this is orange  orange

这使用.str.replace将“this is”替换为空字符串。

我希望这会有所帮助。

答案 1 :(得分:1)

考虑此数据框

df= pd.DataFrame({"C": ['this is orange','this is apple which is red','this is pear','this is plum','this is orange'], "D": [0,0,0,0,0]})

    C                           D
0   this is orange              0
1   this is apple which is red  0
2   this is pear                0
3   this is plum                0
4   this is orange              0

您可以使用以下代码提取水果名称。确定水果的名称如下所示'

df['D'] = df.C.str.extract('this is ([A-Za-z]+)\s?.*?')

你得到了

    C                           D
0   this is orange              orange
1   this is apple which is red  apple
2   this is pear                pear
3   this is plum                plum
4   this is orange              orange

对于您发布的示例数据集,对空间进行简单拆分并提取最后一个元素

df['D'] = df.C.str.split(' ').str[-1]

答案 2 :(得分:1)

如果句子始终以this is开头,后跟fruit name,即如果第三个字始终为fruit name,那么您还可以使用applysplit()函数,以便对每一行数据帧string进行拆分,并将第三个结果用于替换列D的值:

df['D'] = df['C'].apply(lambda val: val.split()[2])

或者如其他答案中所述只是split函数,

df['D'] = df['C'].str.split().str[2]

输出:

C D 0 this is orange orange 1 this is apple apple 2 this is pear pear 3 this is plum plum 4 this is orange orange