Question

这是我正在使用的代码：

ho = ho.replace('((www\.[\s]+)|(https?://[^\s]+))','URL',regex=True)
ho =ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho =ho.replace('\'"',regex=True)

lem = WordNetLemmatizer()
stem = PorterStemmer()
fg=stem.stem(a)

eng_stopwords = stopwords.words('english') 
ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True, 
index=True, na_rep='NaN', formatters=None, float_format=None, 
sparsify=False, index_names=True, justify=None, line_width=None, 
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg)                                     

wordList = [word for word in wordList if word not in eng_stopwords]   
print (wordList)

打印时（a）我得到以下输出。我无法正确执行单词标记化。

                     tweet
0     1495596971.6034188automotive auto ebc greenstu...
1     1495596972.330948new free stock photo of city ...
2     1495596972.775966ebay 1974 volkswagen beetle -...
3     1495596975.6460807cars fly off a hidden speed ...
4     1495596978.12868rt @jiikae guys i think mario ...

这些是csv文件的前5行： -

"1495596971.6034188::automotive auto ebc greenstuff 6000 series supreme 
truck and suv brake pads dp61603 https:\/\/t.co\/jpylzjyd5o cars\u2026 
https:\/\/t.co\/gfsbz6pkj7""display_text_range:[0140]source:""\u003ca 
href=\""https:\/\/dlvrit.com\/\"" 
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
"1495596972.330948::new free stock photo of city cars road 
https:\/\/t.co\/qbkgvkfgpp""display_text_range:[0"
"1495596972.775966::ebay: 1974 volkswagen beetle - classic 1952 custom 
conversion extremely rare 1974 vw beetle\u2026\u2026 
https:\/\/t.co\/wdsnf2pmo7""display_text_range:[0140]source:""\u003ca 
href=\""https:\/\/dlvrit.com\/\"" 
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
"1495596975.6460807::cars fly off a hidden speed bump 
https:\/\/t.co\/fliiqwt1rk https:\/\/t.co\/klx7kfooro""display_text_range:
[056]source:""\u003ca href=\""https:\/\/dlvrit.com\/\"" 
rel=\""nofollow\""\u003edlvr.it\u003c\/a\u003e"""
1495596978.12868::rt @jiikae: guys i think mario is going through a mid-life 
crisis. buying expensive cars using guns hanging out with proport\u2026

Answer 1

我认为您需要str.split所有单词的列表 - 它被所有空格分开 - 对于选择列ho['tweet']也需要tweet：

wordList = word_tokenize(fg) 
#output is string
ho1=ho['tweet'].str.split()
     .apply(lambda x:' '.join([word for word in wordList if word not in eng_stopwords]))

或者：

wordList = word_tokenize(fg) 
#output is list
ho1=ho['tweet'].str.split()
               .apply(lambda x:[word for word in wordList if word not in eng_stopwords])

代替：

ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True, 
index=True, na_rep='NaN', formatters=None, float_format=None, 
sparsify=False, index_names=True, justify=None, line_width=None, 
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg) 
wordList = [word for word in wordList if word not in eng_stopwords]   
print (wordList)

如何在python中扩大下面的输出，因为想在其他地方使用它作为输入？

1 个答案: