Question

我正在尝试编写一个代码，该代码从小说中获取文本并将其转换为字典，其中键是每个唯一的单词，值是文本中单词的出现次数。
例如，它可能看起来像：{＆＃39;＆＃39;：＆＃39;＆＃39;女孩＆＃39;：59 ...等}

我一直试图将文本放入列表中，然后使用Counter函数制作所有单词的字典：

    source = open('novel.html', 'r', encoding = "UTF-8")
    soup = BeautifulSoup(source, 'html.parser')
    #make a list of all the words in file, get rid of words that aren't content
    mylist = []
    mylist.append(soup.find_all('p'))
    newlist = filter(None, mylist)
    cnt = collections.Counter()
    for line in newlist:
         try:
           if line is not None:        
               words = line.split(" ")
               for word in line:
                cnt[word] += 1
         except:
           pass
    print(cnt)

由于＆＃34; NoneType＆＃34;的错误，此代码无法正常工作。或者它只打印一个空列表。我想知道是否有更简单的方法来完成我尝试做的事情或如何修复此代码，以免它出现此错误。

Answer 1

对于柜台只需做一个

from collections import Counter
cnt = Counter(mylist)

你确定你的清单是否开始收到物品？在什么步骤之后你得到一个空列表？

Answer 2

import collections
from bs4 import BeautifulSoup

with open('novel.html', 'r', encoding='UTF-8') as source:
    soup = BeautifulSoup(source, 'html.parser')

cnt = collections.Counter()
for tag in soup.find_all('p'):
    for word in tag.string.split():
        word = ''.join(ch for ch in word.lower() if ch.isalnum())
        if word != '':
            cnt[word] += 1

print(cnt)

with语句只是打开文件的一种更安全的方式

soup.find_all会返回Tag

的列表

tag.string.split()获取Tag

中的所有字词（以空格分隔）

word = ''.join(ch for ch in word.lower() if ch.isalnum())删除标点符号并转换为小写，以便'Hello'和'hello!'计为同一个字

Answer 3

将页面转换为列表后，请尝试以下方法：

#create dictionary and fake list

d = {}
x = ["hi", "hi", "hello", "hey", "hi", "hello", "hey", "hi"]

#count the times a unique word occurs and add that pair to your dictionary

for word in set(x):
    count = x.count(word)
    d[word] = count

输出：

{'hello': 2, 'hey': 2, 'hi': 4}

将文件更改为列表到字典

3 个答案: