我想在这里抓住这个网站:
但是,它需要向下滚动才能收集其他数据。我不知道如何使用美丽的汤或蟒蛇向下滚动。这里有人知道怎么做?
代码有点乱,但现在是。
import scrapy
from scrapy.selector import Selector
from testtest.items import TesttestItem
import datetime
from selenium import webdriver
from bs4 import BeautifulSoup
from HTMLParser import HTMLParser
import re
import time
class MLStripper(HTMLParser):
class MySpider(scrapy.Spider):
name = "A1Locker"
def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()
allowed_domains = ['https://www.a1lockerrental.com']
start_urls = ['http://www.a1lockerrental.com/self-storage/mo/st-
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=all']
def parse(self, response):
url='http://www.a1lockerrental.com/self-storage/mo/st-
louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?
category=Small'
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
url2='http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-
meramec-bottom-rd-facility/unit-sizes-prices#/units?category=Medium'
driver2 = webdriver.Firefox()
driver2.get(url2)
html2 = driver.page_source
soup2 = BeautifulSoup(html2, 'html.parser')
#soup.append(soup2)
#print soup
items = []
inside = "Indoor"
outside = "Outdoor"
inside_units = ["5 x 5", "5 x 10"]
outside_units = ["10 x 15","5 x 15", "8 x 10","10 x 10","10 x
20","10 x 25","10 x 30"]
sizeTagz = soup.findAll('span',{"class":"sss-unit-size"})
sizeTagz2 = soup2.findAll('span',{"class":"sss-unit-size"})
#print soup.findAll('span',{"class":"sss-unit-size"})
rateTagz = soup.findAll('p',{"class":"unit-special-offer"})
specialTagz = soup.findAll('span',{"class":"unit-special-offer"})
typesTagz = soup.findAll('div',{"class":"unit-info"},)
rateTagz2 = soup2.findAll('p',{"class":"unit-special-offer"})
specialTagz2 = soup2.findAll('span',{"class":"unit-special-offer"})
typesTagz2 = soup2.findAll('div',{"class":"unit-info"},)
yield {'date': datetime.datetime.now().strftime("%m-%d-%y"),
'name': "A1Locker"
}
size = []
for n in range(len(sizeTagz)):
print len(rateTagz)
print len(typesTagz)
if "Outside" in (typesTagz[n]).get_text():
size.append(re.findall(r'\d+',
(sizeTagz[n]).get_text()))
size.append(re.findall(r'\d+',
(sizeTagz2[n]).get_text()))
print "logic hit"
for i in range(len(size)):
yield {
#soup.findAll('p',{"class":"icon-bg"})
#'name': soup.find('strong', {'class':'high'}).text
'size': size[i]
#"special": (specialTagz[n]).get_text(),
#"rate": re.findall(r'\d+',(rateTagz[n]).get_text()),
#"size": i.css(".sss-unit-size::text").extract(),
#"types": "Outside"
}
driver.close()
代码的所需输出是显示从此网页收集的数据:http://www.a1lockerrental.com/self-storage/mo/st-louis/4427-meramec-bottom-rd-facility/unit-sizes-prices#/units?category=all
要这样做,需要能够向下滚动以查看其余数据。至少这就是我的想法。
谢谢, DM123
答案 0 :(得分:2)
有一个提供此功能的webdriver功能。除了解析网站之外,BeautifulSoup不做任何事情。
答案 1 :(得分:1)
您尝试抓取的网站是使用JavaScript动态加载内容。不幸的是,许多网络刮刀,如美丽的汤,不能自己执行JavaScript。然而,有许多选项,其中许多是无头浏览器的形式。一个经典的是PhantomJS,但值得一看这个great list of options on GitHub,其中一些可能与美丽的汤,如Selenium很好地搭配。
记住Selenium,this Stackoverflow question的答案也可能有所帮助。