我已经编写了这个代码转到webMD,到目前为止,它从消息板中的每个子类别中提取了所有链接。我接下来要做的是让程序遍历子类别链接的所有页面。我尝试过很多东西,但我总是面临一个问题吗?
import bs4 as bs
import urllib.request
import pandas as pd
source = urllib.request.urlopen('https://messageboards.webmd.com/').read()
soup = bs.BeautifulSoup(source,'lxml')
df = pd.DataFrame(columns = ['link'],data=[url.a.get('href') for url in soup.find_all('div',class_="link")])
lists=[]
for i in range(0,33):
link = (df.link.iloc[i])
source1 = urllib.request.urlopen(link).read()
soup1 = bs.BeautifulSoup(source1,'lxml')
答案 0 :(得分:0)
我过去曾使用Python和Wget执行类似的任务。 See Wget documentation here。您可以查看源代码以了解其工作原理。
基本上你可以做到以下几点。请参阅以下伪代码
alreadyDownloadedUrls = []
currentPageUrls = []
def pageDownloader('url'):
downaload the given URL
append the url to 'alreadyDownloadedUrls' list
return the given URL
def urlFinder('inputPage'):
finds and returns all the URL of the input page in a list
def urlFilter ('inputUrl or list of URLs'):
check if the input list of URLs are already in the 'alreadyDownloadedUrls' list,
if not appends that URL to a local list variable and returns
def controlFunction(firstPage):
Download the first page
firstPageDownload = pageDownloader(firstPage)
foundUrls = urlFinder (firstPageDownload)
validUrls = urlFilter(foundUrls)
currentlyWorkingList = []
for ( the length of validUrls):
downloadUrl = pageDownloader(aUrl from the list)
append to currentlyWorkingList
for (the lenght of currentlyWorkingList):
call controlFunction() recursively
但是,recursively通话将导致您下载整个互联网。 因此,您必须验证URL并查看它是来自父域还是子域。您可以在 urlFilterFunction 中执行此操作。
此外,您还需要添加一些验证,以检查您是否在网址末尾下载了带有哈希标记的相同链接。除非您的程序认为this和this网址指向不同的网页。
您也可以在Wget中引入深度限制
希望这能让你明白这个想法。