解决方案

Question

我在编程和正则表达式方面非常新，所以如果之前已经问过这个，我会道歉（尽管我没有找到）。

我想用Python来汇总文字文本中的单词频率。我们假设文本格式为

Chapter 1
blah blah blah

Chapter 2
blah blah blah
....

现在我将文本作为字符串阅读，并且我想使用re.findall来获取此文本中的每个单词，因此我的代码是

wordlist = re.findall(r'\b\w+\b', text)

但问题是它匹配每个章节标题中的所有这些Chapter，我不希望将其包含在我的统计数据中。所以我想忽略匹配Chapter\s*\d+的内容。我该怎么办？

先谢谢你，伙计们。

Answer 1

解决方案

您可以先删除所有Chapter+space+digits：

wordlist = re.findall(r'\b\w+\b', re.sub(r'Chapter\s*\d+\s*','',text))

如果您只想使用一次搜索，您可以使用否定前瞻来查找任何不在＆＃34;第十章＆＃34;之前的单词。并且不以数字开头：

wordlist = re.findall(r'\b(?!Chapter\s+\d+)[A-Za-z]\w*\b',text)

如果性能是一个问题，加载一个巨大的字符串并用正则表达式解析它无论如何都不是正确的方法。只需逐行阅读文件，抛出与r'^Chapter\s*\d+'匹配的任何行，并使用r'\b\w+\b'分别解析每个剩余行：

import re

lines=open("huge_file.txt", "r").readlines()

wordlist = []
chapter = re.compile(r'^Chapter\s*\d+')
words = re.compile(r'\b\w+\b')
for line in lines:
  if not chapter.match(line):
    wordlist.extend(words.findall(line))

print len(wordlist)

性能

我写了一个小的ruby脚本来写一个巨大的文件：

all_dicts = Dir["/usr/share/dict/*"].map{|dict|
  File.readlines(dict)
}.flatten

File.open('huge_file.txt','w+') do |txt|
  newline=true
  txt.puts "Chapter #{rand(1000)}"
  50_000_000.times do
    if rand<0.05
      txt.puts
      txt.puts
      txt.puts "Chapter #{rand(1000)}"
      newline = true
    end
    txt.write " " unless newline
    newline = false
    txt.write all_dicts.sample.chomp
    if rand<0.10
      txt.puts
      newline = true
    end
  end
end

生成的文件超过5000万字，大约483MB：

Chapter 154
schoolyard trashcan's holly's continuations

Chapter 814
assure sect's Trippe's bisexuality inexperience
Dumbledore's cafeteria's rubdown hamlet Xi'an guillotine tract concave afflicts amenity hurriedly whistled
Carranza
loudest cloudburst's

Chapter 142
spender's
vests
Ladoga

Chapter 896
petition's Vijayawada Lila faucets
addendum Monticello swiftness's plunder's outrage Lenny tractor figure astrakhan etiology's
coffeehouse erroneously Max platinum's catbird succumbed nonetheless Nissan Yankees solicitor turmeric's regenerate foulness firefight
spyglass
disembarkation athletics drumsticks Dewey's clematises tightness tepid kaleidoscope Sadducee Cheerios's

两步过程平均需要12.2秒来提取单词列表，前瞻方法需要13.5秒，而Wiktor的答案也需要13.5秒。我第一次写的前瞻方法使用了re.IGNORECASE，它花了大约18秒。

在阅读整个文件时，所有Regexen方法的性能基本没有差异。

让我感到惊讶的是，readlines脚本花费了大约20.5秒，并且没有使用比其他脚本少得多的内存。如果您对如何改进脚本有所了解，请发表评论！

Answer 2

匹配您不需要的内容并捕获您需要的内容，并将此技术用于仅返回捕获值的FileStream/StreamWriter：

re.findall

详细说明：

re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s) - \bChapter\s*\d+\b整个单词后跟0 +空格和1+位数
Chapter - 或
| - 匹配并捕获第1组一个或多个字词

为避免在结果列表中获取空值，请对其进行过滤（请参阅demo）：

\b(\w+)\b

如何对正则表达式中的某些单词进行例外处理

2 个答案:

解决方案

性能