我正在尝试浏览网站的所有网页,并提取某个标记/类的所有实例。
它似乎是一遍又一遍地从同一页面中提取信息,但我不确定原因,因为len(urls) #The stack of URL's being scraped
中有一个钟形曲线变化,这让我想到了我&# 39; m至少爬过链接,但我可能会不正确地拉出/打印信息。
import urllib
import urlparse
import re
from bs4 import BeautifulSoup
url = "http://weedmaps.com"
如果我尝试仅使用基本的weedmaps.com网址,则不会打印任何内容,但如果我从一个页面开始,该页面包含我正在寻找... url = "https://weedmaps.com/dispensaries/shakeandbake"
的数据类型,那么它会拉动信息输出,但它会一遍又一遍地打印相同的信息。
urls = [url] # Stack of urls to scrape
visited = [url] # Record of scraped urls
htmltext = urllib.urlopen(urls[0]).read()
# While stack of urls is greater than 0, keep scraping for links
while len(urls) > 0:
try:
htmltext = urllib.urlopen(urls[0]).read()
# Except for visited urls
except:
print urls[0]
# Get and Print Information
soup = BeautifulSoup(htmltext)
urls.pop(0)
info = soup.findAll("div", {"class":"story-heading"})
print info
# Number of URLs in stack
print len(urls)
# Append Incomplete Tags
for tag in soup.findAll('a',href=True):
tag['href'] = urlparse.urljoin(url,tag['href'])
if url in tag['href'] and tag['href'] not in visited:
urls.append(tag['href'])
visited.append(tag['href'])
答案 0 :(得分:3)
您当前代码的问题在于您放入队列的网址(urls
)指向同一页面,但指向不同的锚点,例如:
换句话说,tag['href'] not in visited
条件不会过滤指向同一页面的不同网址,而是过滤不同的网址。
从我看到的,你重塑网络抓取框架。但是已经有一个可以节省您的时间,使您的网络抓取代码组织和清洁,并使其显着快于您当前的解决方案 - Scrapy
。
您需要CrawlSpider
,配置rules
以关注链接,例如:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MachineSpider(CrawlSpider):
name = 'weedmaps'
allowed_domains = ['weedmaps.com']
start_urls = ['https://weedmaps.com/dispensaries/shakeandbake']
rules = [
Rule(LinkExtractor(allow=r'/dispensaries/'), callback='parse_hours')
]
def parse_hours(self, response):
print response.url
for hours in response.css('span[itemid="#store"] div.row.hours-row div.col-md-9'):
print hours.xpath('text()').extract()
而不是打印,您的回调应该返回或生成Item
个实例,以后可以以不同的方式在管道中保存到文件或数据库或进程。