这里我有一个名为'traindata'的pandas.series。
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 4
at App.RandomizeCodeBlock(App.java:33)
at App.<init>(App.java:17)
at App.main(App.java:5)
我想要做的是用带句子的句子替换series.values。
我的想法是建立一个新的系列并把词句放入。 我的代码如下:
0 Published: 4:53AM Friday August 29, 2014 Sourc...
1 8 Have your say\n\n\nPlaying low-level club c...
2 Rohit Shetty has now turned producer. But the ...
3 A TV reporter in Serbia almost lost her job be...
4 THE HAGUE -- Tony de Brum was 9 years old in 1...
5 Australian TV cameraman Harry Burton was kille...
6 President Barack Obama sharply rebuked protest...
7 The car displaying the DIE FOR SYRIA! sticker....
8 \nIf you've ever been, you know that seeing th...
9 \nThe former executive director of JBWere has ...
10 Waterloo Road actor Joe Slater has revealed hi...
...
**Name: traindata, Length: 2284, dtype: object**
然后发生错误:
from nltk.stem.porter import PorterStemmer
stem_word_data = np.zeros([2284,1])
ps = PorterStemmer()
for i in range(0,len(traindata)):
tst = word_tokenize(traindata[i])
for word in tst:
word = ps.stem(word)
stem_word_data[i] = word
任何人都知道如何修复此错误,或者有人更好地了解如何用stemmed句子替换series.values?感谢。
答案 0 :(得分:0)
您可以在系列中使用apply
并避免编写循环。
from nltk import word_tokenize
from nltk.stem import PorterStemmer
## intialise stemmer class
pst = PorterStemmer()
## sample data frame
df = pd.DataFrame({'senten': ['I am not dancing','You are playing']})
## apply here
df['senten'] = df['senten'].apply(word_tokenize)
df['senten'] = df['senten'].apply(lambda x: ' '.join([pst.stem(y) for y in x]))
print(df)
senten
0 I am not danc
1 you are play