Question

我的数据框是：

     name     type
0    apple    red fruit with red peel that is edible
1    orange   thick peel that is bitter and used dried sometimes

我想从每一行中提取peel之后的所有文本，并创建一个单独的列

     name     type                                              peel
0    apple    red fruit with red peel that is edible            that is edible
1    orange   thick peel is bitter and used dried               is bitter and used dried

我正在尝试：

def get_peel(desc):
    text = desc.split(' ')
    for i,t in enumerate(text):
        if t.lower() == 'peel':
            return text[i:]
    return 'not found'

df['peel'] = df['type'].apply(get_peel)

但是我得到的结果是：

0         not found
1         not found

我在做什么错了？

Answer 1

将str.extract与正则表达式一起使用。

例如：

df = pd.DataFrame({"name": ['apple', 'orange'], 'type': ['red fruit with red peel that is edible', 'thick peel that is bitter and used dried sometimes']})
df['peel'] = df['type'].str.extract(r"(?<=\bpeel\b)(.*)$")
print(df['peel'])

输出：

0                              that is edible
1     that is bitter and used dried sometimes
Name: peel, dtype: object

Answer 2

请您尝试以下。

df创建：

df = pd.DataFrame({'name':['apple','orange'],
                   'type':['red fruit with red peel that is edible','thick peel that is bitter and used dried sometimes']})

添加新列的代码：

df['peel']=df['type'].replace(regex=True,to_replace=r'.*peel(.*)',value=r'\1')

如何从熊猫中提取匹配模式后的所有文本？

2 个答案: