我需要提取一个没有重复的单词列表。这样我就能计算出单个单词的出现次数
import nltk
import lxml
import bs4
import requests
from nltk.tokenize import word_tokenize, sent_tokenize
wSite="https://www.marxists.org/subject/art/literature/children/texts/orwell/animal-farm/ch01.htm"
page=requests.get(wSite).content
soup = bs4.BeautifulSoup(page, "lxml")
z=soup.find_all("p")
container=""
for i in z:
txt=i.text
if (txt[1]=='"'):
container=container+txt
y=container
a=[]
a=y.split()
b=str(a)
答案 0 :(得分:0)
我已经使用 spaCy 标记文本。
首先安装spaCy和我们将使用的spaCy模型:
pip install spacy
python -m spacy download en_core_web_sm
这很简单。我们得到了网页,将<p>
元素内的所有文本连接起来(忽略页眉和页脚),让spaCy执行其thang,然后删除非单词标记,最后将其赋予Counter数字。
字数在counts
中。查看所有print
调用以查看如何访问counts
。
import requests
import bs4
import spacy
from collections import Counter
url = "https://www.marxists.org/subject/art/literature/children/texts/orwell/animal-farm/ch01.htm"
page_content = requests.get(url).content
soup = bs4.BeautifulSoup(page_content, "lxml")
text = ""
for paragraph in soup.find_all("p"):
# We probably don't want text within the header and footer paragraphs
if paragraph.attrs.get("class", (None,))[0] in ("title", "footer"):
continue
text += paragraph.get_text().lower() # It's best to keeps things in one case
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
# Not all tokens are words, so we exclude some
words = tuple(token.text for token in doc if not (token.is_punct or token.is_space or
token.is_quote or token.is_bracket))
counts = Counter(words)
print("Word count:", len(words)) # Or sum(counts.values())
print("Unique word count:", len(counts))
print("15 most common words:")
for i, (word, count) in enumerate(counts.most_common(15), start=1):
print(f"{i: >2}. {count: >3} - {word}")
print("The word 'animal' occurs:", counts["animal"])
print("The word 'python' occurs:", counts["python"])
print("All words and their count:")
for word, count in counts.items():
print(f"{count}, {word}")
输出:
Word count: 2704
Unique word count: 849
15 most common words:
1. 169 - the
2. 98 - and
3. 93 - of
4. 59 - to
5. 51 - a
6. 44 - in
7. 44 - that
8. 42 - it
9. 34 - i
10. 34 - is
11. 33 - was
12. 31 - had
13. 31 - he
14. 27 - you
15. 24 - all
The word 'animal' occurs: 11
The word 'python' occurs: 0
All words and their count:
4, mr
8, jones
93, of
[...]
1, birds
1, jumped
1, perches