Question

第一次做文本挖掘项目。

在标记化并计算最频繁的单词后，会有很多结果，例如

格式：("string", frequency)
```
('\xe2\x80\x98', 3476)
('\xed\xa0\xbd', 2268)
```
我是否知道这个\ xed，\ xe0或\是否有特殊含义？我谷歌但它找不到任何东西。：
我是否知道是否有任何干净的python正则表达式方式将任何以\开头的推文包含在不需要的单词中（包括进入下面脚本中的“停止”列表）？

这是我制作停用词列表以删除不需要的词语的脚本：

# python 2.7, nltk 3.1

from nltk.corpus import stopwords

import string

punctuation = list(string.punctuation)

# Extra terms to remove
# rt, RT (retweet), via (retweet)

stop = stopwords.words('english') + punctuation + ['rt', 'via'] # Unwanted word list

for line in tweets_file:

    try:

        tweet= json.loads(line)

        terms_stops = [term for term in preprocess(tweet['text']) if term not in stop]

        terms_stops_utf=[x.encode('utf-8') for x in terms_stops]

python删除推文文本，如'\ xe2 \ x80 \ x98'

0 个答案: