Question

在处理数据库中的段落时，我尝试使用sent_tokenizer，但是在带有撇号的字符串上保留了一些奇怪的字符，如下所示：进口重新从nltk.corpus导入wordnet作为wn 来自nltk import sent_tokenize，word_tokenize，tokenize

def foo():
    words = [location]
    corpus = ''.join(words) 
    sentences2 = sent_tokenize(corpus)

    print sentences1

foo()

'words'是来自sqlite DB的段落，我得到了这个：

[u'The exact cause is unknown and is likely to involve multiple factors.', u'AAA formation and rupture may result from elastin and collagen degradation by proteases such as plasmin, matrix metalloproteinases (MMPs), and cathepsin S and K.\r\nInflammatory conditions such as arteritis.', u'Infective conditions such as syphilis and Salmonella bacterial infections.', u'The most common association with AAA is atherosclerosis.', u'Tobacco use accounts for >90% of people who develop an AAA have smoked at some point in their life.', u'There are high familial prevalence rates especially among the males.', u'The nature of the genetic disorder is unclear but may be linked to alpha-1-antitrypsin deficiency or X-linked mutation.', u'Connective tissue disorders, such as Marfan\u2019s syndrome and Ehlers-Danlos syndrome, have also been strongly associated with AAA.']

而不是Marfan，我得到Marfan \ u2019s

使用此代码：

sentences1 = [x for x in corpus if x.isalnum()]

我明白了：

[u'T', u'h', u'e', u'e', u'x', u'a', u'c', u't', u'c', u'a', u'u', u's', u'e', u'i', u's', u'u', u'n', u'k', u'n', u'o', u'w', u'n', u'a', u'n', u'd', u'i', u's', u'l', u'i', u'k', u'e', u'l', u'y', u't', u'o', u'i', u'n', u'v', u'o', u'l', u'v', u'e', u'm', u'u', u'l', u't', u'i', u'p', u'l', u'e', u'f', u'a', u'c', u't', u'o', u'r', u's', u'A', u'A', u'A', u'f', u'o', u'r', u'm', u'a', u't', u'i', u'o', u'n', u'a', u'n', u'd', u'r', u'u', u'p', u't', u'u', u'r', u'e', u'm', u'a', u'y', u'r', u'e', u's', u'u', u'l', u't', u'f', u'r', u'o', u'm', u'e', u'l', u'a', u's', u't', u'i', u'n', u'a', u'n', u'd', u'c', u'o', u'l', u'l', u'a', u'g', u'e', u'n', u'd', u'e', u'g', u'r', u'a', u'd', u'a', u't', u'i', u'o', u'n', u'b', u'y', u'p', u'r', u'o', u't', u'e', u'a', u's', u'e', u's', u's', u'u', u'c', u'h', u'a', u's', u'p', u'l', u'a', u's', u'm', u'i', u'n', u'm', u'a', u't', u'r', u'i', u'x', u'm', u'e', u't', u'a', u'l', u'l', u'o', u'p', u'r', u'o', u't', u'e', u'i', u'n', u'a', u's', u'e', u's', u'M', u'M', u'P', u's', u'a', u'n', u'd', u'c', u'a', u't', u'h', u'e', u'p', u's', u'i', u'n', u'S', u'a', u'n', u'd', u'K', u'I', u'n', u'f', u'l', u'a', u'm', u'm', u'a', u't', u'o', u'r', u'y', u'c', u'o', u'n', u'd', u'i', u't', u'i', u'o', u'n', u's', u's', u'u', u'c', u'h', u'a', u's', u'a', u'r', u't', u'e', u'r', u'i', u't', u'i', u's', u'I', u'n', u'f', u'e', u'c', u't', u'i', u'v', u'e', u'c', u'o', u'n', u'd', u'i', u't', u'i', u'o', u'n', u's', u's', u'u', u'c', u'h', u'a', u's', u's', u'y', u'p', u'h', u'i', u'l', u'i', u's', u'a', u'n', u'd', u'S', u'a', u'l', u'm', u'o', u'n', u'e', u'l', u'l', u'a', u'b', u'a', u'c', u't', u'e', u'r', u'i', u'a', u'l', u'i', u'n', u'f', u'e', u'c', u't', u'i', u'o', u'n', u's', u'T', u'h', u'e', u'm', u'o', u's', u't', u'c', u'o', u'm', u'm', u'o', u'n', u'a', u's', u's', u'o', u'c', u'i', u'a', u't', u'i', u'o', u'n', u'w', u'i', u't', u'h', u'A', u'A', u'A', u'i', u's', u'a', u't', u'h', u'e', u'r', u'o', u's', u'c', u'l', u'e', u'r', u'o', u's', u'i', u's', u'T', u'o', u'b', u'a', u'c', u'c', u'o', u'u', u's', u'e', u'a', u'c', u'c', u'o', u'u', u'n', u't', u's', u'f', u'o', u'r', u'9', u'0', u'o', u'f', u'p', u'e', u'o', u'p', u'l', u'e', u'w', u'h', u'o', u'd', u'e', u'v', u'e', u'l', u'o', u'p', u'a', u'n', u'A', u'A', u'A', u'h', u'a', u'v', u'e', u's', u'm', u'o', u'k', u'e', u'd', u'a', u't', u's', u'o', u'm', u'e', u'p', u'o', u'i', u'n', u't', u'i', u'n', u't', u'h', u'e', u'i', u'r', u'l', u'i', u'f', u'e', u'T', u'h', u'e', u'r', u'e', u'a', u'r', u'e', u'h', u'i', u'g', u'h', u'f', u'a', u'm', u'i', u'l', u'i', u'a', u'l', u'p', u'r', u'e', u'v', u'a', u'l', u'e', u'n', u'c', u'e', u'r', u'a', u't', u'e', u's', u'e', u's', u'p', u'e', u'c', u'i', u'a', u'l', u'l', u'y', u'a', u'm', u'o', u'n', u'g', u't', u'h', u'e', u'm', u'a', u'l', u'e', u's', u'T', u'h', u'e', u'n', u'a', u't', u'u', u'r', u'e', u'o', u'f', u't', u'h', u'e', u'g', u'e', u'n', u'e', u't', u'i', u'c', u'd', u'i', u's', u'o', u'r', u'd', u'e', u'r', u'i', u's', u'u', u'n', u'c', u'l', u'e', u'a', u'r', u'b', u'u', u't', u'm', u'a', u'y', u'b', u'e', u'l', u'i', u'n', u'k', u'e', u'd', u't', u'o', u'a', u'l', u'p', u'h', u'a', u'1', u'a', u'n', u't', u'i', u't', u'r', u'y', u'p', u's', u'i', u'n', u'd', u'e', u'f', u'i', u'c', u'i', u'e', u'n', u'c', u'y', u'o', u'r', u'X', u'l', u'i', u'n', u'k', u'e', u'd', u'm', u'u', u't', u'a', u't', u'i', u'o', u'n', u'C', u'o', u'n', u'n', u'e', u'c', u't', u'i', u'v', u'e', u't', u'i', u's', u's', u'u', u'e', u'd', u'i', u's', u'o', u'r', u'd', u'e', u'r', u's', u's', u'u', u'c', u'h', u'a', u's', u'M', u'a', u'r', u'f', u'a', u'n', u's', u's', u'y', u'n', u'd', u'r', u'o', u'm', u'e', u'a', u'n', u'd', u'E', u'h', u'l', u'e', u'r', u's', u'D', u'a', u'n', u'l', u'o', u's', u's', u'y', u'n', u'd', u'r', u'o', u'm', u'e', u'h', u'a', u'v', u'e', u'a', u'l', u's', u'o', u'b', u'e', u'e', u'n', u's', u't', u'r', u'o', u'n', u'g', u'l', u'y', u'a', u's', u's', u'o', u'c', u'i', u'a', u't', u'e', u'd', u'w', u'i', u't', u'h', u'A', u'A', u'A']

使用其他代码：

sentences1 = sent_tokenize(''.join(corpus.encode('utf8').decode('ascii','ignore')))

我明白了：

[u'The exact cause is unknown and is likely to involve multiple factors.', u'AAA formation and rupture may result from elastin and collagen degradation by proteases such as plasmin, matrix metalloproteinases (MMPs), and cathepsin S and K.\r\nInflammatory conditions such as arteritis.', u'Infective conditions such as syphilis and Salmonella bacterial infections.', u'The most common association with AAA is atherosclerosis.', u'Tobacco use accounts for >90% of people who develop an AAA have smoked at some point in their life.', u'There are high familial prevalence rates especially among the males.', u'The nature of the genetic disorder is unclear but may be linked to alpha-1-antitrypsin deficiency or X-linked mutation.', u'Connective tissue disorders, such as Marfans syndrome and Ehlers-Danlos syndrome, have also been strongly associated with AAA.']

但马凡的皈依了马尔凡。应该留作马凡的

我该如何纠正？

Answer 1

我终于发现这个效果很好：

from unidecode import unidecode

corpus = "".join(words)
sent = []
sent.append(unidecode("".join(corpus)))

如何从python中的字符串中删除非ascii字符

1 个答案: