您好我想创建一个迷你抓取工具但不使用Scrapy
,
我创造了这样的东西:
response = requests.get(url)
homepage_link_list = []
soup = BeautifulSoup(response.content, 'lxml')
for link in soup.findAll("a"):
if link.get("href"):
homepage_link_list.append(link.get("href"))
link_list = []
for item in homepage_link_list:
response = requests.get(item)
soup = BeautifulSoup(response.content, 'lxml')
for link in soup.findAll("a"):
if link.get("href"):
link_list.append(link.get("href"))
虽然我遇到的问题是它只获取网页链接中的链接,但我怎样才能让它获取网站所有链接中的所有链接。
答案 0 :(得分:4)
您需要一个递归调用流程。我在下面写了一个面向类的代码(虽然我在StackOverflow中格式化它有问题)。要点如下:
http://example.com#item1
,请忽略item1
https://example.com
,请忽略http://example.com
http://example.com
,请忽略http://example.com/
''' Scraper.
'''
import re
from urllib.parse import urljoin, urlsplit, SplitResult
import requests
from bs4 import BeautifulSoup
class RecursiveScraper:
''' Scrape URLs in a recursive manner.
'''
def __init__(self, url):
''' Constructor to initialize domain name and main URL.
'''
self.domain = urlsplit(url).netloc
self.mainurl = url
self.urls = set()
def preprocess_url(self, referrer, url):
''' Clean and filter URLs before scraping.
'''
if not url:
return None
fields = urlsplit(urljoin(referrer, url))._asdict() # convert to absolute URLs and split
fields['path'] = re.sub(r'/$', '', fields['path']) # remove trailing /
fields['fragment'] = '' # remove targets within a page
fields = SplitResult(**fields)
if fields.netloc == self.domain:
# Scrape pages of current domain only
if fields.scheme == 'http':
httpurl = cleanurl = fields.geturl()
httpsurl = httpurl.replace('http:', 'https:', 1)
else:
httpsurl = cleanurl = fields.geturl()
httpurl = httpurl.replace('https:', 'http:', 1)
if httpurl not in self.urls and httpsurl not in self.urls:
# Return URL only if it's not already in list
return cleanurl
return None
def scrape(self, url=None):
''' Scrape the URL and its outward links in a depth-first order.
If URL argument is None, start from main page.
'''
if url is None:
url = self.mainurl
print("Scraping {:s} ...".format(url))
self.urls.add(url)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
for link in soup.findAll("a"):
childurl = self.preprocess_url(url, link.get("href"))
if childurl:
self.scrape(childurl)
if __name__ == '__main__':
rscraper = RecursiveScraper("http://bbc.com")
rscraper.scrape()
print(rscraper.urls)
答案 1 :(得分:0)
可能是您想要抓取的链接实际上并不是链接。它们可能是图像。很抱歉在这里写这个答案,我没有太多的声誉可以评论,
答案 2 :(得分:0)
您的代码未获取网站的所有链接,因为它不是递归的。您正在获取主页链接并遍历主页链接内容中可用的链接。但是,您没有遍历您刚刚遍历的链接内容中的链接。我的建议是你应该检查一些树遍历算法,并根据算法开发遍历(递归)方案。树的节点将代表链接,根节点是您在开始时传递的链接。