我正在使用NLTK从文件中删除停用词。该文件是由换行符char分隔的一系列推文。我已经设置了删除停用词,但它也删除了换行符,所以它不再是每行一条推文。这是我的代码:
stuff = codecs.open("/Users/user/Desktop/ngrms/Nonsrcstic.txt", "r", encoding="utf-8")
word_list = stuff.readlines()
[x.encode('utf-8') for x in word_list]
f = open('english')
stops = f.read()
for line in word_list:
for w in line.split('\n'):
if w.lower() not in stops:
with open("nostops_Nonsrcstic.txt", "a") as tweetsNoStops:
tweetsNoStops.write(w.encode('utf-8') + " ")
输入文件如下所示:
Baby boomers are now at the age where "work or retire" is frequently considered choice.
There's a few people I miss but the truth of the matter is, my name probably hasn't crossed their minds or they don't give a shit about me
What you must remember is, I do yarn shows with the help of a Fiat Panda and Tatiana, the trailer, which is small #itfitsbehindaPanda
@BetBright The AP boost won't work lads says try again later is there a problem with the site?
输出如下:
Baby boomers age "work retire" frequently considered choice. There's people miss truth matter is, name probably hasn't crossed minds don't give shit must remember is, yarn shows help Fiat Panda Tatiana, trailer, small #itfitsbehindaPanda @BetBright AP boost won't work lads says try later problem site?