Question

我试图遍历一堆文件，我必须将每个单词放在该文档的列表中。我是这样做的。 stoplist只是我默认要忽略的单词列表。

texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

我返回的是一份文件清单，在每个清单中都有一个单词列表。有些单词仍然包含尾随标点或其他异常。我以为我可以做到这一点，但它似乎没有正常工作

texts = [[word.rstrip() for word in document.lower().split() if word not in stoplist]
         for document in documents]

或者

texts = [[word.rstrip('.,:!?:') for word in document.lower().split() if word not in stoplist]
         for document in documents]

我的另一个问题是这个。我可能会看到这样的字词，我想保留这个字，但是转储尾随数字/特殊字符。

agency[15]
assignment[72],
you&#8217;ll
america&#8217;s

因此，为了清理大部分其他噪音，我想我应该继续从字符串末尾删除字符，直到它是a-zA-Z或者字符串中有更多特殊字符而不是alpha字符，折腾它。你可以看到，在我的最后两个例子中，字符串的结尾是一个字母字符。所以在这些情况下，我应该忽略这个词，因为特殊字符的数量（超过alpha字符）。我想我应该只搜索字符串的结尾，因为我希望在可能的情况下保持连字符的完整。

基本上我想删除每个单词上的所有尾随标点符号，以及可能是处理我刚才描述的情况的子程序。我不知道该怎么做或者是最好的方式。

Answer 1

>>> a = ['agency[15]','assignment72,','you&#8217;11','america&#8217;s']
>>> import re
>>> b = re.compile('\w+')
>>> for item in a:
...     print b.search(item).group(0)
...
agency
assignment72
you
america
>>> b = re.compile('[a-z]+')
>>> for item in a:
...     print b.search(item).group(0)
...
agency
assignment
you
america
>>>

更新

>>> a = "I-have-hyphens-yo!"
>>> re.findall('[a-z]+',a)
['have', 'hyphens', 'yo']
>>> re.findall('[a-z-]+',a)
['-have-hyphens-yo']
>>> re.findall('[a-zA-Z-]+',a)
['I-have-hyphens-yo']
>>> re.findall('\w+',a)
['I', 'have', 'hyphens', 'yo']
>>>

Answer 2

也许尝试使用re.findall

等模式尝试[a-z]+

import re
word_re = re.compile(r'[a-z]+')
texts = [[match.group(0) for match in word_re.finditer(document.lower()) if match.group(0) not in stoplist]
          for document in documents]

texts = [[word for word in word_re.findall(document.lower()) if word not in stoplist]
          for document in documents]

然后，您可以轻松调整正则表达式以获得所需的单词。替代版本使用re.split：

import re
word_re = re.compile(r'[^a-z]+')
texts = [[word for word in word_re.split(document.lower()) if word and word not in stoplist]
          for document in documents]

如何使用rstrip删除尾随字符？

2 个答案: