Question

我有大约600万个文档，每个文档都有一大堆停用词从每个文档中删除。

我学到的技巧是通过使用re编译模式来删除它们。但是现在我遇到了OverflowError。

我按照以下方式处理我的停用词：

states_string =r'\b(' + '|'.join(states) + r')\b'
states_pattern = re.compile(states_string)

states显然是一个字符串列表，例如[＆＃39; NY＆＃39;，＆＃39; CA＆＃39;，...]＆lt; - 无法粘贴这些字符串到目前为止超过一个帖子的限制！

我得到的错误是：OverflowError: regular expression code size limit exceeded。

显然，我正在编译模式的字符串太长了。

有没有人对如何处理这个或其他方法有任何建议。

我所知道的是：[word for word in words if not word in stopwords]但是这会遍历每个单词，所以不太理想。

请注意，停用词的长度为2500。

Answer 1

这似乎是Python正则表达式引擎实现的一个硬性限制：

~/py27 $ ack -C3 'regular expression code size'
Modules/_sre.c
2756-        if (value == (unsigned long)-1 && PyErr_Occurred()) {
2757-            if (PyErr_ExceptionMatches(PyExc_OverflowError)) {
2758-                PyErr_SetString(PyExc_OverflowError,
2759:                                "regular expression code size limit exceeded");
2760-            }
2761-            break;
2762-        }
2763-        self->code[i] = (SRE_CODE) value;
2764-        if ((unsigned long) self->code[i] != value) {
2765-            PyErr_SetString(PyExc_OverflowError,
2766:                            "regular expression code size limit exceeded");
2767-            break;
2768-        }
2769-    }

要绕过极限，您可能需要备用引擎。我建议使用Python生成sed脚本。这是一个帮助您入门的粗略主意：

stopwords = '''
the an of by
for but is why'''.split()

print '#!/bin/sed -f'
for word in stopwords:
    print '/%s/ d' % word

Answer 2

据我所知，你有3个选项 - 分成较小的正则表达式，使用类似python的东西，或者shell（对于sed或awk）。让我们假设你有一个充满单词和停用词列表的文档，你想要一个不同的单词文档 - 停用词。

正则表达式：

stopwords_regex_list = []
chunk_size = 100  # can tweak depending on size
for i in xrange(0, len(stopwords), chunk_size):
    stopwords_slice = stopwords[i:i + chunk_size]
    stopwords_regex_list.append(re.compile('\b(' + '|'.join(stopwords_slice) + ')\b'))
    with open('document') as doc:
        words = doc.read()  # can read only a certain size if the files are massive
    with open('regex_document', 'w') as regex_doc:
        for regex in stopwords_regex_list:
            words = regex.sub('', words)
        regex_doc.write(words)

设定：

stopwords_set = set(stopwords)
with open('document') as doc:
    words = doc.read()
    with open('set_document', 'w') as set_doc:
        for word in words.split(' '):
            if not word in stopwords_set:
                set_doc.write(word + ' ')

桑达：

with open('document') as doc:
    with open('sed_script', 'w') as sed_script:
        sed_script.writelines(['s/\<{}\>//g\n'.format(word) for word in stopwords])
    with open('sed_document', 'w') as sed_doc:
        subprocess.call(['sed', '-f', 'sed_script'], stdout=sed_doc, stdin=doc)

我不是sed专家，所以可能有更好的方法来做到这一点。您可能希望对每种方法进行编码，看看哪种方法最适合您。

Answer 3

我运行了以下内容，并且运行得很好：

>>> states = ['AL', 'AK', 'AS', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'DC', 'FM', 'FL', 'GA', 'GU', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MH', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'MP', 'OH', 'OK', 'OR', 'PW', 'PA', 'PR', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VI', 'VA', 'WA', 'WV', 'WI', 'WY', 'AE', 'AA', 'AP']
>>> states_string = r'\b(' + '|'.join(states) + r')\b'
>>> states_pattern = re.compile(states_string)
>>> states_pattern
<_sre.SRE_Pattern object at 0x00000000034D3C40>

这是我能用你提供的信息做的最好的事情。请在您的问题中发布整个数组，否则我们无法知道您是否使用了除此50状态代码数组之外的任何内容来生成列表。

PS：信用到期的信用：我在这里使用的数组主要基于this gist comment。

使用Python删除停用词 - 快速有效

3 个答案: