Question

我正在编写一个函数来遍历我的数据框中的一列非结构化配方成分，并通过删除特殊字符来清理它。将每个单元格格式化为成分列表（现在单元格被格式化为一个大字符串）。

例如，其中一个字符串如下所示：

'2½ pounds mixed heirloom tomatoes, cored, sliced ¼-inch thick', '3 tablespoons olive oil', '¾ teaspoon kosher salt, divided, plus more'

通常我只是.split（＆＃39;，＆＃39;）但是对于其中一些字符串，我需要确保cored和sliced 1/4-inch thick之类的内容不是＆＃39 ; t变成了自己的列表元素，而是与实际成分相关联。例如，在这种情况下，我希望最终的列表元素为2 1/2 pounds mixed heirloom tomatoes cored sliced 1/4-inch thick）。

为此，我创建了一个函数，使每个字符串有两个遍历。第一个传递清除特殊字符并生成列表的第一个版本，第二个传递评估每个列表项是否应该是它自己的元素，或者是否附加到列表中的前一个元素。

这是代码：

def ingredient_cleanup(cell):
    # creates working list with special characters removed and splitting list elements on commas 
    first_pass = cell.replace("'",'').replace('[','').replace(']','').replace('¼','.25').replace('½','.5').replace('⅓','.33').replace('¾','.75').replace('⅔','.67').lower().strip().split(', ')
    # empty list for final ingredient list
    final_pass = []
    for i in first_pass:
        # if the first element of the string is a number, add to the final ingredient list as-is 
        # note that this will not pick up formatted fractions like ½
        if i[0].isalpha() == False:
            final_pass.append(i)
        # if the first element of the string is a letter, add the string to the last string in the final list 
        else:
        final_pass[-1] = final_pass[-1] + ' ' + i
    return final_pass

然后我尝试使用apply：

来运行它

df_rec['ingredients'] = df_rec['ingredients'].apply(ingredient_cleanup)

当我运行它时虽然我在IndexError: list index out of range部分获得了final_pass.append(i)。我不确定我是如何在空列表中编制索引太多。

Answer 1

我认为您的问题中存在拼写错误，在调用IndexError append方法时，您无法获得list。通常，当您尝试将列表中的列表编入其范围时，您将获得IndexError。 ingredient_cleanup函数中只有一行：

final_pass[-1] = final_pass[-1] + ' ' + i

final_pass为空时引发错误。以下是函数中for循环的修复：

for i in first_pass:
    # if the first element of the string is a number, add to the final ingredient list as-is 
    # note that this will not pick up formatted fractions like ½
    if i[0].isalpha() == False:
        final_pass.append(i)
    # if the first element of the string is a letter, add the string to the last string in the final list 
    elif final_pass:
        final_pass[-1] = final_pass[-1] + ' ' + i
    else:
        final_pass.append(i)

是的，我认为这个功能有点矫枉过正，你可以用这样的正则表达式完成分割字符串的任务：

import re

s = "'2½ pounds mixed heirloom tomatoes, cored, sliced ¼-inch thick', '3 tablespoons olive oil', '¾ teaspoon kosher salt, divided, plus more'"
re.findall(r'\'[^\']*\'', s)
# ["'2½ pounds mixed heirloom tomatoes, cored, sliced ¼-inch thick'", "'3 tablespoons olive oil'", "'¾ teaspoon kosher salt, divided, plus more'"]

Answer 2

您不应该在索引[-1]（或任何其他索引）处访问空列表：

Python 2.7.14 (default, Mar 14 2018, 13:36:31) 
[GCC 7.3.1 20180303 (Red Hat 7.3.1-5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> l=[]
>>> l[-1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list index out of range
>>>

Answer 3

与您的代码一起步步（使用提供的输入'2½磅混合传家宝西红柿，核心，切成1/4英寸厚'，'3汤匙橄榄油'，'茶匙犹太洁食盐，分开，加上更多'）我没有收到错误。

唯一的区别是我在else:语句之后有正确的缩进 - 我假设这只是一个示例错误。

我对final_pass的输出是 `['2.5磅混合传家宝西红柿切成薄片.25英寸厚'，'3汤匙橄榄油'，'。75茶匙犹太盐分多加']

我猜你的代码还有更多，或者你应该尝试打印实际导致错误的行。我怀疑你的else语句可能试图在final_pass中的任何内容之前访问[-1]索引。

获取＆＃34;字符串索引超出范围＆＃34;附加到列表时出错（Python）

3 个答案: