Question

我在熊猫数据框中有一列成分。除了成分名称（例如：1/3杯腰果>腰果）外，我需要删除所有内容。

输入

    recipe_name                                ingredient
0   Truvani Chocolate Turmeric Caramel Cups    ⅓ cup cashews
1   Truvani Chocolate Turmeric Caramel Cups    4 dates
2   Truvani Chocolate Turmeric Caramel Cups    1 tablespoon almond butter
3   Truvani Chocolate Turmeric Caramel Cups    3 tablespoons coconut milk
4   Truvani Chocolate Turmeric Caramel Cups    ½ teaspoon vanilla extract

预期输出

    recipe_name                                ingredient
0   Truvani Chocolate Turmeric Caramel Cups    cashews
1   Truvani Chocolate Turmeric Caramel Cups    dates
2   Truvani Chocolate Turmeric Caramel Cups    almond butter
3   Truvani Chocolate Turmeric Caramel Cups    coconut milk
4   Truvani Chocolate Turmeric Caramel Cups    vanilla extract

我尝试使用字典，将常见单词映射为空字符串，如下所示：

remove_list ={'\d+': '', 'ounces': '', 'ounce': '', 'tablespoons': '', 'tablespoon': '', 'teaspoons': '', 'teaspoon': '', 'cup': '', 'cups': ''}
column = df['ingredient']
column.apply(lambda column: [remove_list[y] if y in remove_list else y for y in column])

这根本没有改变数据。

我也尝试过使用正则表达式：

df['ingredients'] = re.sub(r'|'.join(map(re.escape, remove_list)), '', df['ingredients'])

但这只会给出一个错误，提示“ TypeError：预期的字符串或缓冲区。”

我对Python还是很陌生，所以我认为使用正则表达式是可能的，但我不确定该怎么做。

Answer 1

由于您要用相同的字符替换所有内容，因此只需将它们放入列表中即可。

l = ['\d+', '[^\x00-\x80]+', 'ounces', 'ounce', 'tablespoons', 
     'tablespoon', 'teaspoons', 'teaspoon', 'cup', 'cups']

然后使用一个replace，加入所有内容。

df.ingredient.str.replace('|'.join(l), '', regex=True).str.strip()
# Safer to only replace stand-alone words. strip not needed
#df.ingredient.str.replace('|'.join([x + '\s' for x in l]), '', regex=True)

输出：

0            cashews
1              dates
2      almond butter
3       coconut milk
4    vanilla extract
Name: ingredient, dtype: object

我将'[^\x00-\x80]+'添加到列表中以删除那些小数字符，而.str.strip则删除了替换后的多余或前导空白。

Answer 2

为此，pandas数据框中内置了一组字符串函数。

类似的事情应该起作用：

df['ingredient'] = df['ingredient'].str.replace('\d+', '', regex=True)

我不知道您是否可以使用字典，您可能必须遍历字典以获取所需的所有替换词。

for ptn, rpl in remove_list.items():
    df['ingredient'] = df['ingredient'].str.replace(ptn, rpl, regex=True)

Answer 3

您可以使用循环和.split()方法：

i = 0
for row in df['ingredient']:
    item = row.split(sep=' ', maxsplit=1)
    df['ingredient'].loc[i] = item[1]
    i += 1

输出将是：

    recipe_name                                ingredient
0   Truvani Chocolate Turmeric Caramel Cups    cup cashews
1   Truvani Chocolate Turmeric Caramel Cups    dates
2   Truvani Chocolate Turmeric Caramel Cups    tablespoon almond butter
3   Truvani Chocolate Turmeric Caramel Cups    tablespoons coconut milk
4   Truvani Chocolate Turmeric Caramel Cups    teaspoon vanilla extract

如果要保留测量值，可以创建一个重复的列，在一个列中保留值，在另一列中保留成分。

删除pandas数据框列中的多个子字符串

3 个答案:

输出：