如何使用已编译的正则表达式和/或列表理解为机器学习管道准备文本?

时间:2018-12-11 10:30:59

标签: python regex nlp list-comprehension

我正在尝试在Python函数中为机器学习管道准备文本,但是我无法获得正确的输出。因此,我想小写所有单词,用空格替换符号,删除符号并从nltk中删除停用词。从列表理解到正则表达式模式匹配,我尝试了各种不同的方法,但我仍然做不到。请帮忙!这是必要的导入和基本功能:

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

import re

功能如下:

REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
"""
    text: a string

    return: modified initial string
"""

lower = text.lower() # lowercase text
space_replace = REPLACE_BY_SPACE_RE.sub(" ",lower) #replace REPLACE_BY_SPACE_RE symbols by space in text
nosymb = BAD_SYMBOLS_RE.sub("",space_replace) # delete symbols which are in BAD_SYMBOLS_RE from text
text = [word for word in nosymb if word not in STOPWORDS] # delete                                                                                                             stopwords from text

return text

这是一个测试功能:

def test_text_prepare():
    examples = ["SQL Server - any equivalent of Excel's CHOOSE function?",
            "How to free c++ memory vector<int> * arr?"]
    answers = ["sql server equivalent excels choose function", 
           "free c++ memory vectorint arr"]
    for ex, ans in zip(examples, answers):
    if text_prepare(ex) != ans:
        return "Wrong answer for the case: '%s'" % ex
return 'Basic tests are passed.'

这是我的测试结果:

print(test_text_prepare())
Wrong answer for the case: 'SQL Server - any equivalent of Excel's CHOOSE function?'

3 个答案:

答案 0 :(得分:0)

您正在将[word for word in nosymb if word not in STOPWORDS]中的单词视为单独的符号。此外,您不会剥离前导/尾随空格,也不会“缩小”先前清理操作产生的多余空格。

这是一个更新的方法:

def text_prepare(text):
    """
    text: a string
        return: modified initial string
    """
    lower = text.lower() # lowercase text
    space_replace = REPLACE_BY_SPACE_RE.sub(" ",lower) #replace REPLACE_BY_SPACE_RE symbols by space in text
    nosymb = BAD_SYMBOLS_RE.sub("",space_replace) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = re.sub(r"\s*\b(?:{})\b".format("|".join(STOPWORDS)), "", nosymb) # delete STOPWORDS
    return re.sub(r" {2,}", " ", text.strip())

re.sub(r"\s*\b(?:{})\b".format("|".join(STOPWORDS)), "", nosymb)部分会删除所有匹配为整个单词的停用词(\b是单词边界)。

re.sub(r" {2,}", " ", text.strip())部分会修剪字符串,并将所有空格缩小为字符串内的单个空格。

答案 1 :(得分:0)

text = [如果单词不在停用词中,则在nosymb中逐个单词地将每个字符视为单独的符号。 试试这个:

text =''.join([如果单词不在停用词中,则为nosymb.split()中的单词为单词])

它首先将文本分成单词列表,然后将它们组合成字符串。

功能如下:

def text_prepare(text):

    lower = text.lower()# lowercase text
    space_replaced = REPLACE_BY_SPACE_RE.sub(" ",text)# replace REPLACE_BY_SPACE_RE symbols by space in text
    nosymb = BAD_SYMBOLS_RE.sub("",text)# delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join([word for word in nosymb.split() if word not in STOPWORDS]) # delete stopwords from text
    return text

答案 2 :(得分:-2)

def text_prepare(text):
    """
        text: a string

        return: modified initial string
    """
    text = text.lower()# lowercase text
    text_first = re.sub(REPLACE_BY_SPACE_RE,' ',text)# replace REPLACE_BY_SPACE_RE symbols by space in text
    text_second = re.sub(BAD_SYMBOLS_RE,'',text_first)#text.remove(BAD_SYMBOLS_RE)# delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join([w for w in text_second.split() if not w in STOPWORDS])# delete stopwords from text
    return text