我正在使用以下内容从网页获取所有外部Javascript引用。如何修改代码不仅要搜索网址,还要搜索网站的所有页面?
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('https://stackoverflow.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
if link.has_key('src'):
if 'http' in link['src']:
print link['src']
首先尝试让它刮下两页深。关于如何让它返回唯一的网址的任何建议?原样,大多数都是重复的。 (请注意,所有内部链接在我需要运行此网站的网站上都包含“index”一词。)
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
site = 'http://www.stackoverflow.com/'
http = httplib2.Http()
status, response = http.request(site)
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
if 'index' in link['href']:
page = site + link['href']
status, response = http.request(page)
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
if link.has_key('src'):
if 'http' in link['src']:
print "script" + " " + link['src']
for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
print "iframe" + " " + iframe['src']
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
if 'index' in link['href']:
page = site + link['href']
status, response = http.request(page)
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
if link.has_key('src'):
if 'http' in link['src']:
print "script" + " " + link['src']
for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
print "iframe" + " " + iframe['src']
答案 0 :(得分:0)
抓取网站是一个广泛的主题。决定如何索引内容并进一步深入网站。它包括内容解析,就像您的基本爬虫或蜘蛛正在做的那样。编写类似于Google Bot的卓越机器人绝对不是一件容易的事。专业爬行机器人做了很多工作,其中可能包括
如果只是在Stackoverflow等特定网站上进行抓取,我已经修改了代码以进行递归抓取。将此代码进一步转换为多线程形式将是微不足道的。它使用bloomfilter来确保它不需要再次抓取同一页面。让我提前警告,在进行爬行时仍会有意外陷阱。像Scrapy,Nutch或Heritrix这样的成熟爬行软件可以做得更好。
import requests
from bs4 import BeautifulSoup as Soup, SoupStrainer
from bs4.element import Tag
from bloom_filter import BloomFilter
from Queue import Queue
from urlparse import urljoin, urlparse
visited = BloomFilter(max_elements=100000, error_rate=0.1)
visitlist = Queue()
def isurlabsolute(url):
return bool(urlparse(url).netloc)
def visit(url):
print "Visiting %s" % url
visited.add(url)
return requests.get(url)
def parsehref(response):
if response.status_code == 200:
for link in Soup(response.content, 'lxml', parse_only=SoupStrainer('a')):
if type(link) == Tag and link.has_attr('href'):
href = link['href']
if isurlabsolute(href) == False:
href = urljoin(response.url, href)
href = str(href)
if href not in visited:
visitlist.put_nowait(href)
else:
print "Already visited %s" % href
else:
print "Got issues mate"
if __name__ == '__main__':
visitlist.put_nowait('http://www.stackoverflow.com/')
while visitlist.empty() != True:
url = visitlist.get()
resp = visit(url)
parsehref(resp)
visitlist.task_done()
visitlist.join()