我遇到了一个我似乎无法理解的问题。我编写了一个函数,该函数将一个数据框作为输入,然后对其执行许多清理步骤。运行该函数时,出现错误消息KeyError: ('amount', 'occurred at index date')
。这对我来说没有意义,因为amount
是我数据框中的一列。
以下是一些代码,其中包含创建的数据的一部分:
data = pd.DataFrame.from_dict({"date": ["10/31/2019","10/27/2019"], "amount": [-13.3, -6421.25], "vendor": ["publix","verizon"]})
#create cleaning function for dataframe
def cleaning_func(x):
#convert the amounts to positive numbers
x['amount'] = x['amount'] * -1
#convert dates to datetime for subsetting purposes
x['date'] = pd.to_datetime(x['date'])
#begin removing certain strings
x['vendor'] = x['vendor'].str.replace("PURCHASE AUTHORIZED ON ","")
x['vendor'] = x['vendor'].str.replace("[0-9]","")
x['vendor'] = x['vendor'].str.replace("PURCHASE WITH CASH BACK $ . AUTHORIZED ON /","")
#build table of punctuation and remove from vendor strings
table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation}
x['vendor'] = x['vendor'].str.translate(table)
return x
clean_data = data.apply(cleaning_func)
如果有人可以弄清为什么会出现此错误,我将不胜感激。
答案 0 :(得分:1)
请不要在此处使用apply
,它的速度很慢,并且基本上会遍历您的数据框。只需将数据传递给函数,然后让它返回清理的数据帧,这样它将在整个列中使用矢量化方法。
def cleaning_func(df):
#convert the amounts to positive numbers
df['amount'] = df['amount'] * -1
#convert dates to datetime for subsetting purposes
df['date'] = pd.to_datetime(df['date'])
#begin removing certain strings
df['vendor'] = df['vendor'].str.replace("PURCHASE AUTHORIZED ON ","")
df['vendor'] = df['vendor'].str.replace("[0-9]","")
df['vendor'] = df['vendor'].str.replace("PURCHASE WITH CASH BACK $ . AUTHORIZED ON /","")
#build table of punctuation and remove from vendor strings
table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation}
df['vendor'] = df['vendor'].str.translate(table)
return df
clean_df = cleaning_func(data)