Question

我需要提取一个没有重复的单词列表。这样我就能计算出单个单词的出现次数

import nltk
import lxml
import bs4
import requests
from nltk.tokenize import word_tokenize, sent_tokenize
wSite="https://www.marxists.org/subject/art/literature/children/texts/orwell/animal-farm/ch01.htm"
page=requests.get(wSite).content
soup = bs4.BeautifulSoup(page, "lxml")
z=soup.find_all("p")

container=""
for i in z:
    txt=i.text

    if (txt[1]=='"'):
        container=container+txt
y=container
a=[]
a=y.split()
b=str(a)

Answer 1

我已经使用 spaCy 标记文本。

首先安装spaCy和我们将使用的spaCy模型：

pip install spacy
python -m spacy download en_core_web_sm

这很简单。我们得到了网页，将<p>元素内的所有文本连接起来（忽略页眉和页脚），让spaCy执行其thang，然后删除非单词标记，最后将其赋予Counter数字。

字数在counts中。查看所有print调用以查看如何访问counts。

import requests
import bs4
import spacy
from collections import Counter

url = "https://www.marxists.org/subject/art/literature/children/texts/orwell/animal-farm/ch01.htm"

page_content = requests.get(url).content
soup = bs4.BeautifulSoup(page_content, "lxml")
text = ""
for paragraph in soup.find_all("p"):
    # We probably don't want text within the header and footer paragraphs
    if paragraph.attrs.get("class", (None,))[0] in ("title", "footer"):
        continue
    text += paragraph.get_text().lower() # It's best to keeps things in one case

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
# Not all tokens are words, so we exclude some
words = tuple(token.text for token in doc if not (token.is_punct or token.is_space or
                                                 token.is_quote or token.is_bracket))
counts = Counter(words)

print("Word count:", len(words)) # Or sum(counts.values())
print("Unique word count:", len(counts))
print("15 most common words:")
for i, (word, count) in enumerate(counts.most_common(15), start=1):
    print(f"{i: >2}. {count: >3} - {word}")

print("The word 'animal' occurs:", counts["animal"])
print("The word 'python' occurs:", counts["python"])
print("All words and their count:")
for word, count in counts.items():
    print(f"{count}, {word}")

输出：

Word count: 2704
Unique word count: 849
15 most common words:
 1. 169 - the
 2.  98 - and
 3.  93 - of
 4.  59 - to
 5.  51 - a
 6.  44 - in
 7.  44 - that
 8.  42 - it
 9.  34 - i
10.  34 - is
11.  33 - was
12.  31 - had
13.  31 - he
14.  27 - you
15.  24 - all
The word 'animal' occurs: 11
The word 'python' occurs: 0
All words and their count:
4, mr
8, jones
93, of
[...]
1, birds
1, jumped
1, perches

如何在以下代码中删除重复项？

1 个答案: