Question

我目前正在一个项目中进行测试和训练数据以进行情感分析。从那时起，我遇到了一个与re.sub（）有关的问题，我无法弄清楚如何解决该问题。我的代码如下：

import re def preprocessor(text):
    text = re.sub(r"<[^>]*>", "",  text) # removes all the html markup
    emoticons = re.findall('(?::|;|= )(?:-)?(?:\)|\(|D|P)', text)
    # removed all the non word charecter and convert them into lower case
    text = (re.sub(r'[\W]+', '', text.lower()) + ''.join(emoticons).replace('-', ''))
    return text

如您所见，该函数运行正常，没有引发异常。但是，由于我要打印文本以查看它是否产生我想要的结果，因此得到以下输出：

preprocessor(df.loc[0, 'review'][-50:])` 



'isseventitlebrazilnotavailable'

我的愿望输出应该是：

'is seven title brazil not available'

我有点猜测我的re.sub（）正在删除所有空格，但是我不知道该如何解决。

答案是可以理解的。

N.B：我想从以下方式清除字符串：作为示例：从 “是七。

标题（巴西）：不可用”

'is seven title brazil not available'

谢谢

Answer 1

您可以尝试以下操作：

text = 'is seven.<br /><br />Title (Brazil): Not  Available' 
## remove tags
text = re.sub(r"<.*?>", " ",  text)
## sub with blank
text = re.sub(r'[^a-zA-Z0-9\s+]', '', text)
print(text)

输出：

'is seven Title Brazil Not Available'

Answer 2

当您在正则表达式中使用\W时，它也包括空格字符。在您的情况下，这些也将由空字符串替换。为了演示，这是一段代码，

import re

text = "This is my Text"
text1 = re.sub(r'[\W]+', '', text.lower())
text2 = re.sub(r'[^a-zA-Z0-9_\s]+', '', text.lower())

print(text1)
print(text2)

如果您检查docs [^a-zA-Z0-9_]实际上等效于\W。如果您不希望将它们用空字符串替换（如上面\s的示例所示），则需要在该列表中添加空格正则表达式符号（text2）。

清理文本数据以进行情感分析和口碑表达

2 个答案: