接收IndexError:使用apply时字符串索引超出范围

时间:2016-07-17 11:56:35

标签: python nltk

我想通过

从数据框中选择最常用的名词
  1. 将我的数据的每一行中的名词分开。
  2. 存储一个名为train ['token']
  3. 的新列

    为此,我将我的函数传递给apply函数,但是我收到了这个错误

    IndexError:字符串索引超出范围

    这是我的代码

    import pandas as pd
    import numpy as np
    import nltk
    
    train= pd.read_csv(r'C:\Users\JKC\Downloads\classification_train.csv',names=['product_title','brand_id','category_id'])
    
    train['product_title'] = train['product_title'].apply(lambda x: x.lower())
    
    def preprocessing(x):
        tokens = nltk.pos_tag(x.split(" "))
        list=[]
        for y,x in tokens:
            if(x=="NN" or x=="NNS" or x=="NNP" or x=="NNPS"):
                list.append(y)
        return(' '.join(list))
    # My function works fine if I use preprocessing(train['product_title'][1])    
    
    
    
    train['token'] = train['product_title'].apply(preprocessing,1)
    

    追溯:

    IndexError                                Traceback (most recent call last)
    <ipython-input-53-f9f247eec617> in <module>()
         10 
         11 
    ---> 12 train['token'] = train['product_title'].apply(preprocessing,1)
         13 
    
    C:\Users\JKC\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
       2235             values = lib.map_infer(values, boxer)
       2236 
    -> 2237         mapped = lib.map_infer(values, f, convert=convert_dtype)
       2238         if len(mapped) and isinstance(mapped[0], Series):
       2239             from pandas.core.frame import DataFrame
    
    pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:63043)()
    
    <ipython-input-53-f9f247eec617> in preprocessing(x)
          1 def preprocessing(x):
    ----> 2         tokens = nltk.pos_tag(x.split(" "))
          3         list=[]
          4         for y,x in tokens:
          5                 if(x=="NN" or x=="NNS" or x=="NNP" or x=="NNPS"):
    
    C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\__init__.py in pos_tag(tokens, tagset)
        109     """
        110     tagger = PerceptronTagger()
    --> 111     return _pos_tag(tokens, tagset, tagger)
        112 
        113 
    
    C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\__init__.py in _pos_tag(tokens, tagset, tagger)
         80 
         81 def _pos_tag(tokens, tagset, tagger):
    ---> 82     tagged_tokens = tagger.tag(tokens)
         83     if tagset:
         84         tagged_tokens = [(token, map_tag('en-ptb', tagset, tag)) for (token, tag) in tagged_tokens]
    
    C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in tag(self, tokens)
        150         output = []
        151 
    --> 152         context = self.START + [self.normalize(w) for w in tokens] + self.END
        153         for i, word in enumerate(tokens):
        154             tag = self.tagdict.get(word)
    
    C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in <listcomp>(.0)
        150         output = []
        151 
    --> 152         context = self.START + [self.normalize(w) for w in tokens] + self.END
        153         for i, word in enumerate(tokens):
        154             tag = self.tagdict.get(word)
    
    C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in normalize(self, word)
        224         elif word.isdigit() and len(word) == 4:
        225             return '!YEAR'
    --> 226         elif word[0].isdigit():
        227             return '!DIGITS'
        228         else:
    
    IndexError: string index out of range
    
    Data:
                                               product_title brand_id category_id
        0  120gb hard disk drive with 3 years warranty fo...     3950           8
        1  toshiba satellite l305-s5919 laptop lcd screen...    35099         324
        2  hobby-ace pixhawk px4 rgb external led indicat...    21822         510
        3                                  pelicans mousepad    44629         260
        4    p4648-60029 hewlett-packard tc2100 system board    42835          68
    

    我的数据中没有空行:

    train.isnull().sum()
    Out[12]: 
    product_title    0
    brand_id         0
    category_id      0
    dtype: int64
    

1 个答案:

答案 0 :(得分:8)

您的输入在某些地方包含两个或多个连续的空格。当您使用x.split(" ")拆分它时,您将获得零长度&#34;单词&#34;在相邻的空间之间。

通过拆分x.split()来修复它,它会将任何连续的空格字符作为标记分隔符处理。