我正在尝试编写一个代码,该代码从小说中获取文本并将其转换为字典,其中键是每个唯一的单词,值是文本中单词的出现次数。
例如,它可能看起来像:{'':''女孩':59 ...等}
我一直试图将文本放入列表中,然后使用Counter函数制作所有单词的字典:
source = open('novel.html', 'r', encoding = "UTF-8")
soup = BeautifulSoup(source, 'html.parser')
#make a list of all the words in file, get rid of words that aren't content
mylist = []
mylist.append(soup.find_all('p'))
newlist = filter(None, mylist)
cnt = collections.Counter()
for line in newlist:
try:
if line is not None:
words = line.split(" ")
for word in line:
cnt[word] += 1
except:
pass
print(cnt)
由于" NoneType"的错误,此代码无法正常工作。或者它只打印一个空列表。我想知道是否有更简单的方法来完成我尝试做的事情或如何修复此代码,以免它出现此错误。
答案 0 :(得分:1)
对于柜台只需做一个
from collections import Counter
cnt = Counter(mylist)
你确定你的清单是否开始收到物品?在什么步骤之后你得到一个空列表?
答案 1 :(得分:1)
import collections
from bs4 import BeautifulSoup
with open('novel.html', 'r', encoding='UTF-8') as source:
soup = BeautifulSoup(source, 'html.parser')
cnt = collections.Counter()
for tag in soup.find_all('p'):
for word in tag.string.split():
word = ''.join(ch for ch in word.lower() if ch.isalnum())
if word != '':
cnt[word] += 1
print(cnt)
with
语句只是打开文件的一种更安全的方式
soup.find_all
会返回Tag
tag.string.split()
获取Tag
word = ''.join(ch for ch in word.lower() if ch.isalnum())
删除标点符号并转换为小写,以便'Hello'
和'hello!'
计为同一个字
答案 2 :(得分:0)
将页面转换为列表后,请尝试以下方法:
#create dictionary and fake list
d = {}
x = ["hi", "hi", "hello", "hey", "hi", "hello", "hey", "hi"]
#count the times a unique word occurs and add that pair to your dictionary
for word in set(x):
count = x.count(word)
d[word] = count
输出:
{'hello': 2, 'hey': 2, 'hi': 4}