Question

我正在尝试使用Python 3.6.2中的pandas 0.20.3探索这个dataset。

%pylab inline
import pandas as pd
df = pd.read_csv('OnlineNewsPopularity.csv')
df['n_tokens_content'][:9]

最后一行产生错误

KeyError跟踪（最近的呼叫   持续）   〜/ anaconda3 / envs / tf11 / lib / python3.6 / site-packages / pandas / core / indexes / base.py   在get_loc（self，key，method，tolerance）2441中尝试：   -> 2442返回self._engine.get_loc（key）2443，除了KeyError：
     pandas._libs.index.IndexEngine.get_loc中的
pandas / _libs / index.pyx   （pandas / _libs / index.c：5280）（）
     pandas._libs.index.IndexEngine.get_loc中的
pandas / _libs / index.pyx   （pandas / _libs / index.c：5126）（）

pandas / _libs / hashtable_class_helper.pxi在   pandas._libs.hashtable.PyObjectHashTable.get_item   （pandas / _libs / hashtable.c：20523）（）

pandas / _libs / hashtable_class_helper.pxi在   pandas._libs.hashtable.PyObjectHashTable.get_item   （pandas / _libs / hashtable.c：20477）（）

KeyError：'n_tokens_content'

在处理上述异常期间，发生了另一个异常：

KeyError跟踪（最近的呼叫   最后）在（）   ----> 1 df ['n_tokens_content'] [：9]

〜/ anaconda3 / envs / tf11 / lib / python3.6 / site-packages / pandas / core / frame.py   在 getitem （（自己，密钥）1962年返回   self._getitem_multilevel（key）1963其他：   -> 1964返回self._getitem_column（key）1965 1966 def _getitem_column（self，key）：

〜/ anaconda3 / envs / tf11 / lib / python3.6 / site-packages / pandas / core / frame.py   in _getitem_column（self，key）1969＃获取列1970
  如果self.columns.is_unique：   -> 1971 return self._get_item_cache（key）1972 1973＃复制列并可能降低维数

〜/ anaconda3 / envs / tf11 / lib / python3.6 / site-packages / pandas / core / generic.py   在_get_item_cache（自己，项目）中1643 res = cache.get（item）
  1644如果res为None：   -> 1645个值= self._data.get（项目）1646 res = self._box_item_values（项目，值）1647
  cache [item] = res

〜/ anaconda3 / envs / tf11 / lib / python3.6 / site-packages / pandas / core / internals.py   在get（self，item，fastpath）中3588 3589如果不是   isull（item）：   -> 3590 loc = self.items.get_loc（item）3591其他：3592索引器=   np.arange（len（self.items））[isnull（self.items）]

〜/ anaconda3 / envs / tf11 / lib / python3.6 / site-packages / pandas / core / indexes / base.py   在get_loc（自身，键，方法，公差）中2442
  返回self._engine.get_loc（key）2443，但KeyError除外：   -> 2444返回self._engine.get_loc（self._maybe_cast_indexer（key））2445 2446
  indexer = self.get_indexer（[key]，method = method，tolerance = tolerance）
     pandas._libs.index.IndexEngine.get_loc中的
pandas / _libs / index.pyx   （pandas / _libs / index.c：5280）（）
     pandas._libs.index.IndexEngine.get_loc中的
pandas / _libs / index.pyx   （pandas / _libs / index.c：5126）（）

pandas / _libs / hashtable_class_helper.pxi在   pandas._libs.hashtable.PyObjectHashTable.get_item   （pandas / _libs / hashtable.c：20523）（）

pandas / _libs / hashtable_class_helper.pxi在   pandas._libs.hashtable.PyObjectHashTable.get_item   （pandas / _libs / hashtable.c：20477）（）

KeyError：'n_tokens_content'

我认为这是由于csv文件中的某些行引起的，因为这段代码对其他csv都适用。

如果是，如何有效地定位不良行？

Answer 1

当您使用df.columns打印列时，'n_tokens_content'开头会有一个空格。

输入：df.columns

输出：

Index(['url', ' timedelta', ' n_tokens_title', ' n_tokens_content',
   ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens',
   ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos',
   ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle',
   ' data_channel_is_entertainment', ' data_channel_is_bus',
   ' data_channel_is_socmed', ' data_channel_is_tech',
   ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min',
   ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg',
   ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares',
   ' self_reference_max_shares', ' self_reference_avg_sharess',
   ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
   ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
   ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02',
   ' LDA_03', ' LDA_04', ' global_subjectivity',
   ' global_sentiment_polarity', ' global_rate_positive_words',
   ' global_rate_negative_words', ' rate_positive_words',
   ' rate_negative_words', ' avg_positive_polarity',
   ' min_positive_polarity', ' max_positive_polarity',
   ' avg_negative_polarity', ' min_negative_polarity',
   ' max_negative_polarity', ' title_subjectivity',
   ' title_sentiment_polarity', ' abs_title_subjectivity',
   ' abs_title_sentiment_polarity', ' shares'],
  dtype='object')

输入为：df[' n_tokens_content'][:9]

输出： 0 219 1 255 2 211 3 531 4 1072 5 370 6 960 7 989 8 97

Answer 2

我遇到了同样的问题，并且已经解决：
输入： df.columns 输出：

 Index(['url', ' timedelta', ' n_tokens_title', ' n_tokens_content',
       ' n_unique_tokens', ' n_non_stop_words', ' n_non_stop_unique_tokens',
       ' num_hrefs', ' num_self_hrefs', ' num_imgs', ' num_videos',
       ' average_token_length', ' num_keywords', ' data_channel_is_lifestyle',
       ' data_channel_is_entertainment', ' data_channel_is_bus',
       ' data_channel_is_socmed', ' data_channel_is_tech',
       ' data_channel_is_world', ' kw_min_min', ' kw_max_min', ' kw_avg_min',
       ' kw_min_max', ' kw_max_max', ' kw_avg_max', ' kw_min_avg',
       ' kw_max_avg', ' kw_avg_avg', ' self_reference_min_shares',
       ' self_reference_max_shares', ' self_reference_avg_sharess',
       ' weekday_is_monday', ' weekday_is_tuesday', ' weekday_is_wednesday',
       ' weekday_is_thursday', ' weekday_is_friday', ' weekday_is_saturday',
       ' weekday_is_sunday', ' is_weekend', ' LDA_00', ' LDA_01', ' LDA_02',
       ' LDA_03', ' LDA_04', ' global_subjectivity',
       ' global_sentiment_polarity', ' global_rate_positive_words',
       ' global_rate_negative_words', ' rate_positive_words',
       ' rate_negative_words', ' avg_positive_polarity',
       ' min_positive_polarity', ' max_positive_polarity',
       ' avg_negative_polarity', ' min_negative_polarity',
       ' max_negative_polarity', ' title_subjectivity',
       ' title_sentiment_polarity', ' abs_title_subjectivity',
       ' abs_title_sentiment_polarity', ' shares'],
      dtype='object')

您会发现列"n_tokens_title"的标题是" n_tokens_title"，注意n_tokens_title前面的空格，并在代码中添加空格。

切片熊猫数据框遇到KeyError：“ n_tokens_content”，如何有效定位不良行？

2 个答案: