Question

您一直在玩一个简单的程序，该程序读取文本并识别首字母大写的关键字。我遇到的问题是该程序不会删除标点符号，我的意思是，Frodo Frodo。佛罗多即将成为不同的参赛作品而不是相同的作品。我尝试使用导入字符串并使用标点符号，但它不起作用。

下面是我的代码，我使用的文本来自http://www.angelfire.com/rings/theroaddownloads/fotr.pdf（复制到名为novel.txt的txt文档中）。再次感谢

by_word = {}
with open ('novel.txt') as f:
  for line in f:
    for word in line.strip().split():
      if word[0].isupper():
        if word in by_word:
          by_word[word] += 1
        else:
          by_word[word] = 1

by_count = []
for word in by_word:
  by_count.append((by_word[word], word))

by_count.sort()
by_count.reverse()

for count, word in by_count[:100]:
  print(count, word)

Answer 1

希望以下内容对您有所帮助：

import string
exclude = set(string.punctuation)

by_word = {}
with open ('novel.txt') as f:
  for line in f:
    for word in line.strip().split():
      if word[0].isupper():
        word = ''.join(char for char in word if char not in exclude)
        if word in by_word:
          by_word[word] += 1
        else:
          by_word[word] = 1

by_count = []
for word in by_word:
  by_count.append((by_word[word], word))

by_count.sort()
by_count.reverse()

for count, word in by_count[:100]:
  print(count, word)

它将删除所有

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

来自word。

Answer 2

您的代码很好，可以删除标点符号，使用正则表达式进行拆分，

for word in line.strip().split():

可以更改为

for word in re.split('[,.;]',line.strip()):

其中，[]中的第一个参数包含所有标点符号。这使用re模块https://docs.python.org/2/library/re.html#re.split。

Python - 关键字阅读程序，无法删除标点符号

2 个答案: