这是我正在使用的代码:
ho = ho.replace('((www\.[\s]+)|(https?://[^\s]+))','URL',regex=True)
ho =ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho =ho.replace('\'"',regex=True)
lem = WordNetLemmatizer()
stem = PorterStemmer()
fg=stem.stem(a)
eng_stopwords = stopwords.words('english')
ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True,
index=True, na_rep='NaN', formatters=None, float_format=None,
sparsify=False, index_names=True, justify=None, line_width=None,
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg)
wordList = [word for word in wordList if word not in eng_stopwords]
print (wordList)
打印时(a)我得到以下输出。我无法正确执行单词标记化。
tweet
0 1495596971.6034188automotive auto ebc greenstu...
1 1495596972.330948new free stock photo of city ...
2 1495596972.775966ebay 1974 volkswagen beetle -...
3 1495596975.6460807cars fly off a hidden speed ...
4 1495596978.12868rt @jiikae guys i think mario ...
这些是csv文件的前5行: -
"1495596971.6034188::automotive auto ebc greenstuff 6000 series supreme
truck and suv brake pads dp61603 https:\/\/t.co\/jpylzjyd5o cars\u2026
https:\/\/t.co\/gfsbz6pkj7""display_text_range:[0140]source:""\u003ca
href=\""https:\/\/dlvrit.com\/\""
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
"1495596972.330948::new free stock photo of city cars road
https:\/\/t.co\/qbkgvkfgpp""display_text_range:[0"
"1495596972.775966::ebay: 1974 volkswagen beetle - classic 1952 custom
conversion extremely rare 1974 vw beetle\u2026\u2026
https:\/\/t.co\/wdsnf2pmo7""display_text_range:[0140]source:""\u003ca
href=\""https:\/\/dlvrit.com\/\""
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
"1495596975.6460807::cars fly off a hidden speed bump
https:\/\/t.co\/fliiqwt1rk https:\/\/t.co\/klx7kfooro""display_text_range:
[056]source:""\u003ca href=\""https:\/\/dlvrit.com\/\""
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
1495596978.12868::rt @jiikae: guys i think mario is going through a mid-life
crisis. buying expensive cars using guns hanging out with proport\u2026
答案 0 :(得分:2)
我认为您需要str.split
所有单词的列表 - 它被所有空格分开 - 对于选择列ho['tweet']
也需要tweet
:
wordList = word_tokenize(fg)
#output is string
ho1=ho['tweet'].str.split()
.apply(lambda x:' '.join([word for word in wordList if word not in eng_stopwords]))
或者:
wordList = word_tokenize(fg)
#output is list
ho1=ho['tweet'].str.split()
.apply(lambda x:[word for word in wordList if word not in eng_stopwords])
代替:
ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True,
index=True, na_rep='NaN', formatters=None, float_format=None,
sparsify=False, index_names=True, justify=None, line_width=None,
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg)
wordList = [word for word in wordList if word not in eng_stopwords]
print (wordList)