网上搜寻:搜寻字词

时间:2019-05-23 14:32:56

标签: python-3.x web web-crawler

我正在准备一个Web爬网程序。

更准确地说,它是对主页的索引(=获取所有单词),并找到该页面的所有链接,然后在提取的链接中查找(主页的)单词。

我的问题出在函数indexe()中:当我尝试恢复所有辅助页面时,请检查单词是否不在非索引字表中(=代词,文章等),并且这些单词都在第二页。 这样我就可以在辅助页面中搜索所有单词(从主页)

这是我现在所做的:

import requests 
from bs4 import BeautifulSoup

def extract(links) : 
    page = requests.get(links).text
    soup = BeautifulSoup(page)
    for link in soup.find_all('a', href=True):
        print(link['href'])

def clean_html(page) :
   cleanr = re.compile('<.*?>')
   cleantext = re.sub(cleanr, '', page)
   return cleantext 

def indexer(dex, words, url):
    for x in url : 
        x = requests.get(x).text
        x = clean_html(x)
        x = x.lower()
        x = x.split()                 

    for word in words:
        word = word.lower()
        word = clean(word)

        if word not in stoplist :
           if word in x : 
            # print(x) THE PROBLEM: I'm trying to retrieve the secondary pages, but I get only the last link, (36 times)
            add(dex, word, url)

def add(dex, word, url):
    try:
        dex[word].append(url)
    except KeyError:
        dex[word] = [url]


def main(url, idx) :     
    list_urls = extract(url) 
    main_page = requests.get(url).text
    main_page = clean_html(main_page)
    main_page = main_page.split()

    idx = {}
    indexe(idx, main_page, list_urls)
    prd(idx)

def prd(d) :  
   for c in sorted(d) : 
         print( '\t', c, ':', d[c])


stoplist = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

我想在辅助页面中找到单词,希望输出结果是这样的

Word1 : [url1, url2]
Word2 : [url1, url3, ...]
... 

0 个答案:

没有答案