我正在读取CSV文件中的行;我正在应用LDA算法来查找最常见的主题,在doc_processed中对数据进行处理之后,每个字都得到了“ u”,但是为什么呢?请建议我从经过处理的文档中删除“ u”,我在Python 2.7中的代码是
data = [line.strip() for line in open("/home/dnandini/test/h.csv", 'r')]
stop = set(stopwords.words('english'))# stop words
exclude = set(string.punctuation) #to reomve the punctuation
lemma = WordNetLemmatizer() # to map with parts of speech
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
shortword = re.compile(r'\W*\b\w{1,2}\b')
output=shortword.sub('', normalized)
return output
doc_processed = [clean(doc) for doc in data]
输出为doc_processed-
[u'amount', u'ze69heneicosadien11one', u'trap', u'containing', u'little', u'microg', u'zz69ket', u'attracted', u'male', u'low', u'population', u'level', u'indicating', u'potent', u'sex', u'attractant', u'trap', u'baited', u'z6ket', u'attracted', u'male', u'windtunnel', u'bioassay', u'least', u'100fold', u'attractive', u'male', u'zz69ket', u'improvement', u'trap', u'catch', u'occurred', u'addition', u'z6ket', u'various', u'binary', u'mixture', u'zz69ket', u'including', u'female', u'ratio', u'ternary', u'mixture', u'zz69ket']
答案 0 :(得分:0)
u'some string'
格式表示它是一个unicode字符串。有关unicode字符串本身的更多详细信息,请参见this question,但是最简单的方法可能是在从str.encode
返回结果之前,先clean
修复结果。
def clean(doc):
# all as before until
output = shortword.sub('', normalized).encode()
return output
请注意,尝试对无法直接转换为默认编码(似乎是ASCII。请参见系统上的sys.getdefaultencoding()
的unicode码点)进行编码将在此处引发错误。您可以通过定义errors
kwarg进行编码来以各种方式处理错误。
s.encode(errors="ignore") # omit the code point that fails to encode
s.encode(errors="replace") # replace the failing code point with '?'
s.encode(errors="xmlcharrefreplace") # replace the failing code point with ' '
# Note that the " " above is U+FFFD, the Unicode Replacement Character.