Question

标记化后，我的句子中包含许多奇怪的字符。如何删除它们？这是我的代码：

def summary(filename, method):
    list_names = glob.glob(filename)
    orginal_data = []
    topic_data = []
    print(list_names)
    for file_name in list_names:
        article = []
        article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
        for line in article_temp:
            print(line)
            if (line.strip()):
                tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
                sentences = tokenizer.tokenize(line)
                print(sentences)
                article = article + sentences
        orginal_data.append(article)
        topic_data.append(preprocess_data(article))
    if (method == "orig"):
        summary = generate_summary_origin(topic_data, 100, orginal_data)
    elif (method == "best-avg"):
        summary = generate_summary_best_avg(topic_data, 100, orginal_data)
    else:
        summary = generate_summary_simplified(topic_data, 100, orginal_data)
    return summary

print(line)打印一行txt。 print(sentences)在行中打印标记化的句子。

但是在nltk处理之后，有时句子中包含奇怪的字符。

Assaly, who is a fan of both Pusha T and Drake, said he and his friends 
wondered if people in the crowd might boo Pusha T during the show, but 
said he never imagined actual violence would take place.

[u'Assaly, who is a fan of both Pusha T and Drake, said he and his 
friends wondered if people in\xa0the crowd might boo Pusha\xa0T during 
the show, but said he never imagined actual violence would take 
place.']

像上面的示例一样，\xa0和\xa0T来自哪里？

Answer 1

x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in\xa0the crowd might boo Pusha\xa0T during the show, but said he never imagined actual violence would take place.'

# method 1 
x.replace('\xa0', ' ')

# method 2
import unicodedata
unicodedata.normalize('NFKD', x)

print(x)

输出：

Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.

参考：unicodedata.normalize()

NLTK令牌生成器编码问题

1 个答案: