如何替换pandas.Series中的词干句?

时间:2018-05-05 12:51:01

标签: python pandas nlp

这里我有一个名为'traindata'的pandas.series。

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 4
    at App.RandomizeCodeBlock(App.java:33)
    at App.<init>(App.java:17)
    at App.main(App.java:5)

我想要做的是用带句子的句子替换series.values。

我的想法是建立一个新的系列并把词句放入。 我的代码如下:

    0       Published: 4:53AM Friday August 29, 2014 Sourc...
    1       8  Have your say\n\n\nPlaying low-level club c...
    2       Rohit Shetty has now turned producer. But the ...
    3       A TV reporter in Serbia almost lost her job be...
    4       THE HAGUE -- Tony de Brum was 9 years old in 1...
    5       Australian TV cameraman Harry Burton was kille...
    6       President Barack Obama sharply rebuked protest...
    7       The car displaying the DIE FOR SYRIA! sticker....
    8       \nIf you've ever been, you know that seeing th...
    9       \nThe former executive director of JBWere has ...
    10      Waterloo Road actor Joe Slater has revealed hi...
                        ... 
    **Name: traindata, Length: 2284, dtype: object**

然后发生错误:

    from nltk.stem.porter import PorterStemmer

    stem_word_data = np.zeros([2284,1])
    ps = PorterStemmer()
    for i in range(0,len(traindata)):
        tst = word_tokenize(traindata[i]) 
        for word in tst:
            word = ps.stem(word)    
            stem_word_data[i] = word

任何人都知道如何修复此错误,或者有人更好地了解如何用stemmed句子替换series.values?感谢。

1 个答案:

答案 0 :(得分:0)

您可以在系列中使用apply并避免编写循环。

from nltk import word_tokenize
from nltk.stem import PorterStemmer

## intialise stemmer class
pst = PorterStemmer()

## sample data frame
df = pd.DataFrame({'senten': ['I am not dancing','You are playing']})

## apply here
df['senten'] = df['senten'].apply(word_tokenize)
df['senten'] = df['senten'].apply(lambda x: ' '.join([pst.stem(y) for y in x]))

print(df)

          senten
0  I am not danc
1   you are play