我正在准备一个Web爬网程序。
更准确地说,它是对主页的索引(=获取所有单词),并找到该页面的所有链接,然后在提取的链接中查找(主页的)单词。
我的问题出在函数indexe()中:当我尝试恢复所有辅助页面时,请检查单词是否不在非索引字表中(=代词,文章等),并且这些单词都在第二页。 这样我就可以在辅助页面中搜索所有单词(从主页)
这是我现在所做的:
import requests
from bs4 import BeautifulSoup
def extract(links) :
page = requests.get(links).text
soup = BeautifulSoup(page)
for link in soup.find_all('a', href=True):
print(link['href'])
def clean_html(page) :
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', page)
return cleantext
def indexer(dex, words, url):
for x in url :
x = requests.get(x).text
x = clean_html(x)
x = x.lower()
x = x.split()
for word in words:
word = word.lower()
word = clean(word)
if word not in stoplist :
if word in x :
# print(x) THE PROBLEM: I'm trying to retrieve the secondary pages, but I get only the last link, (36 times)
add(dex, word, url)
def add(dex, word, url):
try:
dex[word].append(url)
except KeyError:
dex[word] = [url]
def main(url, idx) :
list_urls = extract(url)
main_page = requests.get(url).text
main_page = clean_html(main_page)
main_page = main_page.split()
idx = {}
indexe(idx, main_page, list_urls)
prd(idx)
def prd(d) :
for c in sorted(d) :
print( '\t', c, ':', d[c])
stoplist = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
我想在辅助页面中找到单词,希望输出结果是这样的
Word1 : [url1, url2]
Word2 : [url1, url3, ...]
...