我使用scrapy来获取数据 http://www.bbb.org/greater-san-francisco/business-reviews/architects/klopf-architecture-in-san-francisco-ca-152805
所以我创建了一些项目来保存信息,但是每次运行脚本时我都没有得到所有数据,通常我会得到一些空项目,所以我需要再次运行脚本,直到我得到所有的项目
这是蜘蛛的代码
import scrapy
from tutorial.items import Product
from scrapy.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["bbb.org/"]
start_urls = [
"http://www.bbb.org/greater-san-francisco/business-reviews/architects/klopf-architecture-in-san-francisco-ca-152805"
#"http://www.bbb.org/greater-san-francisco/business-reviews/architects/a-d-architects-in-oakland-ca-133229"
#"http://www.bbb.org/greater-san-francisco/business-reviews/architects/aecom-in-concord-ca-541360"
]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)
producto = Product()
#producto['name'] = response.xpath('//*[@id="business-detail"]/div/h1')
producto = Product(Name=response.xpath('//*[@id="business-detail"]/div/h1/text()').extract(),
Telephone=response.xpath('//*[@id="business-detail"]/div/p/span[1]/text()').extract(),
Address=response.xpath('//*[@id="business-detail"]/div/p/span[2]/span[1]/text()').extract(),
Description=response.xpath('//*[@id="business-description"]/p[2]/text()').extract(),
BBBAccreditation =response.xpath('//*[@id="business-accreditation-content"]/p[1]/text()').extract(),
Complaints=response.xpath('//*[@id="complaint-sort-container"]/text()').extract(),
Reviews=response.xpath('//*[@id="complaint-sort-container"]/p/text()').extract(),
WebPage=response.xpath('//*[@id="business-detail"]/div/p/span[3]/a/text()').extract(),
Rating = response.xpath('//*[@id="accedited-rating"]/img/text()').extract(),
ServiceArea = response.xpath('//*[@id="business-additional-info-text"]/span[4]/p/text()').extract(),
ReasonForRating = response.xpath('//*[@id="reason-rating-content"]/ul/li[1]/text()').extract(),
NumberofEmployees = response.xpath('//*[@id="business-additional-info-text"]/p[8]/text()').extract(),
LicenceNumber = response.xpath('//*[@id="business-additional-info-text"]/p[6]/text()').extract(),
Contact = response.xpath('//*[@id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),
BBBFileOpened = response.xpath('//*[@id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),
BusinessStarted = response.xpath('//*[@id="business-additional-info-text"]/span[3]/span/span[1]/text()').extract(),)
#producto.add_xpath('name', '//*[@id="business-detail"]/div/h1')
#product.add_value('name', 'today') # you can also use literal values
#product.load_item()
return producto
这个页面需要设置一个用户代理,所以我有一个用户代理文件,可能比其中一些错了吗?
答案 0 :(得分:1)
是的,您的一些用户代理可能是错误的(可能是一些旧的,已弃用)和网站,如果只使用一个用户代理没有问题,您可以将其添加到settings.py
:< / p>
USER_AGENT="someuseragent"
请记住也从settings.py