标记化后,我的句子中包含许多奇怪的字符。如何删除它们? 这是我的代码:
def summary(filename, method):
list_names = glob.glob(filename)
orginal_data = []
topic_data = []
print(list_names)
for file_name in list_names:
article = []
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary
print(line)
打印一行txt。 print(sentences)
在行中打印标记化的句子。
但是在nltk处理之后,有时句子中包含奇怪的字符。
Assaly, who is a fan of both Pusha T and Drake, said he and his friends
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.
[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people in\xa0the crowd might boo Pusha\xa0T during
the show, but said he never imagined actual violence would take
place.']
像上面的示例一样,\xa0
和\xa0T
来自哪里?
答案 0 :(得分:2)
x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in\xa0the crowd might boo Pusha\xa0T during the show, but said he never imagined actual violence would take place.'
# method 1
x.replace('\xa0', ' ')
# method 2
import unicodedata
unicodedata.normalize('NFKD', x)
print(x)
输出:
Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.