无法在文本预处理中用空格替换数字

时间:2018-12-16 06:21:56

标签: python nlp

我正在尝试将文本作为NLP的一部分进行预处理。我是新手。我不明白为什么我无法替换数字

para = "support leaders around the world who do not speak for the big 
polluters, but who speak for all of humanity, for the indigenous people of 
the world, for the first 100 people.In 90's it seems true."

import re
import nltk

sentences = nltk.sent_tokenize(para)

for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [re.sub(r'\d','',words)]
    sentences[i] = ' '.join(words)

在执行此操作时,出现以下错误:


TypeError                                 Traceback (most recent call last)
<ipython-input-28-000671b45ee1> in <module>()
       2 for i in range(len(sentences)):
       3     words = nltk.word_tokenize(sentences[i])
 ----> 4     words = [re.sub(r'\d','',words)].encode('utf8')
       5     sentences[i] = ' '.join(words)

~\Anaconda3\lib\re.py in sub(pattern, repl, string, count, flags)
  189     a callable, it's passed the match object and must return
  190     a replacement string to be used."""
  --> 191     return _compile(pattern, flags).sub(repl, string, count)
  192 
  193  def subn(pattern, repl, string, count=0, flags=0):

  TypeError: expected string or bytes-like object

我如何转换为类似对象的字节。我很困惑,因为我是新来的。

3 个答案:

答案 0 :(得分:1)

该错误试图告诉您您使用不是字符串的东西来调用re.sub(忽略“或字节”和“类似”部分:您可以使用真正的字符串)。罪魁祸首是words:函数nltk.word_tokenize()返回一个列表,您不能将整个内容传递给re.sub。您需要另一个for循环或理解力。这是一种理解,将re.sub应用于w的每个元素words

sentences = nltk.sent_tokenize(para)
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [re.sub(r'\d','',w) for w in words]
    sentences[i] = ' '.join(words)

在处理此问题时,我建议您将循环替换为列表元素上的循环。这样更好,但是您必须将结果存储在新列表中:

sentences = nltk.sent_tokenize(para)
clean = []
for sent in sentences:
    words = nltk.word_tokenize(sent)
    words = [re.sub(r'\d','',w) for w in words]
    clean.append(' '.join(words))

PS。您可以通过在整个句子,甚至在拆分之前对整个段落应用替换来简化代码。但这与您的问题无关...

答案 1 :(得分:0)

这是您想要做的吗?还是我错过了重点?

import re

para = """support leaders around the world who do not speak for the big 
polluters, but who speak for all of humanity, for the indigenous people of 
the world, for the first 100 people.In 90's it seems true."""

tokenized = para.split(' ')
new_para = []
for w in tokenized:
    w = re.sub('[0-9]', '', w)
    new_para.append(w)
print(' '.join(new_para))

答案 2 :(得分:0)

要替换字符串中的所有数字,可以使用re模块,以匹配和替换正则表达式模式。在上一个示例中:

import re

processed_words = [re.sub('\d',' ', word) for word in tokenized]