Question

我正在使用数据集online retail。

有一个名为InvoiceNo的列，代表发票代码。如果此代码以字母'c'开头，则表示取消。

我想将 InvoiceNo 分组为 InvoiceNo包含'C'的实例。

import pandas as pd
import numpy as np    
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx'
    retail_df = pd.read_excel(url)
    temp_df = retail_df[retail_df['InvoiceNo'].str.contains('c')]

我收到了一个错误：

ValueError                                Traceback (most recent call last)
<ipython-input-29-e1f6cb12695b> in <module>()
----> 1 temp_df = retail_df[retail_df['InvoiceNo'].str.contains('c')]

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1956         if isinstance(key, (Series, np.ndarray, Index, list)):
   1957             # either boolean or fancy integer index
-> 1958             return self._getitem_array(key)
   1959         elif isinstance(key, DataFrame):
   1960             return self._getitem_frame(key)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_array(self, key)
   1983     def _getitem_array(self, key):
   1984         # also raises Exception if object array with NA values
-> 1985         if com.is_bool_indexer(key):
   1986             # warning here just in case -- previously __setitem__ was
   1987             # reindexing but __getitem__ was not; it seems more reasonable to

~/anaconda3/lib/python3.6/site-packages/pandas/core/common.py in is_bool_indexer(key)
    187             if not lib.is_bool_array(key):
    188                 if isnull(key).any():
--> 189                     raise ValueError('cannot index with vector containing '
    190                                      'NA / NaN values')
    191                 return False

ValueError: cannot index with vector containing NA / NaN values

，而InvoiceNo列不包含任何NA值。

retail_df['InvoiceNo'].isnull().sum()

输出：0

所以我不明白为什么它不起作用。

我还测试了使用：

retail_df['order_canceled'] = retail_df['InvoiceNo'].apply(lambda x:int('C' in x))

并收到错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-28-e82a12535b70> in <module>()
----> 1 retail_df['order_canceled'] = retail_df['InvoiceNo'].apply(lambda x:int('C' in x))

~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()

<ipython-input-28-e82a12535b70> in <lambda>(x)
----> 1 retail_df['order_canceled'] = retail_df['InvoiceNo'].apply(lambda x:int('C' in x))

TypeError: argument of type 'int' is not iterable

怎么做？

Answer 1

InvoiceNo列中包含数字和字符串，请尝试以下操作：

In [22]: retail_df[retail_df['InvoiceNo'].astype(str).str.contains('C')]
Out[22]:
       InvoiceNo StockCode                          Description  Quantity         InvoiceDate  UnitPrice  CustomerID  \
141      C536379         D                             Discount        -1 2010-12-01 09:41:00      27.50     14527.0
154      C536383    35004C      SET OF 3 COLOURED  FLYING DUCKS        -1 2010-12-01 09:49:00       4.65     15311.0
235      C536391     22556       PLASTERS IN TIN CIRCUS PARADE        -12 2010-12-01 10:24:00       1.65     17548.0
236      C536391     21984     PACK OF 12 PINK PAISLEY TISSUES        -24 2010-12-01 10:24:00       0.29     17548.0
237      C536391     21983     PACK OF 12 BLUE PAISLEY TISSUES        -24 2010-12-01 10:24:00       0.29     17548.0
238      C536391     21980    PACK OF 12 RED RETROSPOT TISSUES        -24 2010-12-01 10:24:00       0.29     17548.0
239      C536391     21484          CHICK GREY HOT WATER BOTTLE       -12 2010-12-01 10:24:00       3.45     17548.0
240      C536391     22557     PLASTERS IN TIN VINTAGE PAISLEY        -12 2010-12-01 10:24:00       1.65     17548.0
241      C536391     22553               PLASTERS IN TIN SKULLS       -24 2010-12-01 10:24:00       1.65     17548.0
939      C536506     22960             JAM MAKING SET WITH JARS        -6 2010-12-01 12:38:00       4.25     17897.0

...

过滤数据帧字符串列 - 类型为'int'的参数不可迭代/不能使用包含NA / NaN值的向量进行索引

1 个答案: