我正在使用数据集online retail。
有一个名为InvoiceNo的列,代表发票代码。如果此代码以字母'c'开头,则表示取消。
我想将 InvoiceNo 分组为 InvoiceNo包含'C'的实例。
import pandas as pd
import numpy as np
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
retail_df = pd.read_excel(url)
temp_df = retail_df[retail_df['InvoiceNo'].str.contains('c')]
我收到了一个错误:
ValueError Traceback (most recent call last)
<ipython-input-29-e1f6cb12695b> in <module>()
----> 1 temp_df = retail_df[retail_df['InvoiceNo'].str.contains('c')]
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
1956 if isinstance(key, (Series, np.ndarray, Index, list)):
1957 # either boolean or fancy integer index
-> 1958 return self._getitem_array(key)
1959 elif isinstance(key, DataFrame):
1960 return self._getitem_frame(key)
~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_array(self, key)
1983 def _getitem_array(self, key):
1984 # also raises Exception if object array with NA values
-> 1985 if com.is_bool_indexer(key):
1986 # warning here just in case -- previously __setitem__ was
1987 # reindexing but __getitem__ was not; it seems more reasonable to
~/anaconda3/lib/python3.6/site-packages/pandas/core/common.py in is_bool_indexer(key)
187 if not lib.is_bool_array(key):
188 if isnull(key).any():
--> 189 raise ValueError('cannot index with vector containing '
190 'NA / NaN values')
191 return False
ValueError: cannot index with vector containing NA / NaN values
,而InvoiceNo列不包含任何NA值。
retail_df['InvoiceNo'].isnull().sum()
输出:0
所以我不明白为什么它不起作用。
我还测试了使用:
retail_df['order_canceled'] = retail_df['InvoiceNo'].apply(lambda x:int('C' in x))
并收到错误:
TypeError Traceback (most recent call last)
<ipython-input-28-e82a12535b70> in <module>()
----> 1 retail_df['order_canceled'] = retail_df['InvoiceNo'].apply(lambda x:int('C' in x))
~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
2353 else:
2354 values = self.asobject
-> 2355 mapped = lib.map_infer(values, f, convert=convert_dtype)
2356
2357 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-28-e82a12535b70> in <lambda>(x)
----> 1 retail_df['order_canceled'] = retail_df['InvoiceNo'].apply(lambda x:int('C' in x))
TypeError: argument of type 'int' is not iterable
怎么做?
答案 0 :(得分:2)
InvoiceNo
列中包含数字和字符串,请尝试以下操作:
In [22]: retail_df[retail_df['InvoiceNo'].astype(str).str.contains('C')]
Out[22]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID \
141 C536379 D Discount -1 2010-12-01 09:41:00 27.50 14527.0
154 C536383 35004C SET OF 3 COLOURED FLYING DUCKS -1 2010-12-01 09:49:00 4.65 15311.0
235 C536391 22556 PLASTERS IN TIN CIRCUS PARADE -12 2010-12-01 10:24:00 1.65 17548.0
236 C536391 21984 PACK OF 12 PINK PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0
237 C536391 21983 PACK OF 12 BLUE PAISLEY TISSUES -24 2010-12-01 10:24:00 0.29 17548.0
238 C536391 21980 PACK OF 12 RED RETROSPOT TISSUES -24 2010-12-01 10:24:00 0.29 17548.0
239 C536391 21484 CHICK GREY HOT WATER BOTTLE -12 2010-12-01 10:24:00 3.45 17548.0
240 C536391 22557 PLASTERS IN TIN VINTAGE PAISLEY -12 2010-12-01 10:24:00 1.65 17548.0
241 C536391 22553 PLASTERS IN TIN SKULLS -24 2010-12-01 10:24:00 1.65 17548.0
939 C536506 22960 JAM MAKING SET WITH JARS -6 2010-12-01 12:38:00 4.25 17897.0
...