Web爬虫页面迭代

时间:2017-01-19 18:36:37

标签: python web-scraping web-crawler

我已经编写了这个代码转到webMD,到目前为止,它从消息板中的每个子类别中提取了所有链接。我接下来要做的是让程序遍历子类别链接的所有页面。我尝试过很多东西,但我总是面临一个问题吗?

import bs4 as bs
import urllib.request
import pandas as pd


source = urllib.request.urlopen('https://messageboards.webmd.com/').read()

soup = bs.BeautifulSoup(source,'lxml')


df = pd.DataFrame(columns = ['link'],data=[url.a.get('href') for url in soup.find_all('div',class_="link")])
lists=[]
for i in range(0,33):
    link = (df.link.iloc[i])
    source1 = urllib.request.urlopen(link).read()
    soup1 = bs.BeautifulSoup(source1,'lxml')

1 个答案:

答案 0 :(得分:0)

我过去曾使用Python和Wget执行类似的任务。 See Wget documentation here。您可以查看源代码以了解其工作原理。

基本上你可以做到以下几点。请参阅以下伪代码

alreadyDownloadedUrls = []
currentPageUrls = []

def pageDownloader('url'):
    downaload the given URL
    append the url to 'alreadyDownloadedUrls' list
    return the given URL

def urlFinder('inputPage'): 
    finds and returns all the URL of the input page in a list

def urlFilter ('inputUrl or list of URLs'):
    check if the input list of URLs are already in the 'alreadyDownloadedUrls' list, 
    if not appends that URL to a local list variable and returns

def controlFunction(firstPage):
    Download the first page
    firstPageDownload = pageDownloader(firstPage)
    foundUrls = urlFinder (firstPageDownload)
    validUrls = urlFilter(foundUrls)
    currentlyWorkingList = []
    for ( the length of validUrls):
         downloadUrl = pageDownloader(aUrl from the list)
         append to currentlyWorkingList
    for (the lenght of currentlyWorkingList):
        call controlFunction() recursively

但是,recursively通话将导致您下载整个互联网。 因此,您必须验证URL并查看它是来自父域还是子域。您可以在 urlFilterFunction 中执行此操作。

此外,您还需要添加一些验证,以检查您是否在网址末尾下载了带有哈希标记的相同链接。除非您的程序认为thisthis网址指向不同的网页。

您也可以在Wget中引入深度限制

希望这能让你明白这个想法。