挣扎着爬行'一个特定术语的几个网页

时间:2017-08-15 09:22:15

标签: python html web web-crawler webpage

我在python中创建了一个程序,要求搜索一个搜索词和网页。我认为添加功能可以“爬行”是很好的。很多网页。我是通过创建一个类来完成的,因为这是我知道你可以从某些东西继承的唯一方法。我的问题出现了。如果我把它放入,那么类函数会被正确调用。但是对于我需要它做的事情,我不知道在哪里放置对类函数的调用,因为我需要的东西只能来自某些函数,但是它们不是。在其他功能中可用。

import urllib.request
from html.parser import HTMLParser

class trawler(HTMLParser):
    def __init__(self):
        pass

    def handle_starttag(self,webpage,term):
        for tag in webpage:
            if tag == 'a':
                if attrs == 'href':
                    for value in attrs:
                        search_Webpage_2(term, value)

trawl = trawler()

def main():
    print('Please enter your search term:')
    searchTerm = input('> ')

    print('Please enter the WebSite to be searched:')
    searchWebsite = input('> ')

    search_Webpage(searchTerm, searchWebsite)

def open_Webpage(url):
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    webPage = response.read()
    return webPage.decode()

def open_Webpage_2(url):
    request = urllib.request.Request(url)
    response = urllib.request.urlopen(request)
    webPage = response.read()
    return webPage.decode()

def search_Webpage(term, url):
    if term in open_Webpage(url):
        print ('Word:',term,'was found in WebPage:',url)
    else:
        print ('No matches in WebSite:',url)

def search_Webpage_2(term, url):
    if term in open_Webpage_2(url):
        print ('Word:',term,'was found in WebPage:',url)
    else:
        print ('No matches in WebSite:',url)
main()

trawl.handle_starttag获取网页,这是网页的原始html和搜索字词。

感谢任何帮助

0 个答案:

没有答案