我是Python和Scrapy的新手。我想从网站http://www.vodafone.com.au/about/legal/critical-information-summary/plans中提取信息,包括文档链接,名称和有效链接。
我尝试了这段代码,但它不起作用。如果有人能解释并帮助我,我将不胜感激。
这是文件vodafone.py
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from vodafone_scraper.items import VodafoneScraperItem
class VodafoneSpider(scrapy.Spider):
name = 'vodafone'
allowed_domains = ['vodafone.com.au']
start_urls = ['http://www.vodafone.com.au/about/legal/critical-information-summary/plans']
def parse(self, response):
for sel in response.xpath('//tbody/tr/td[1]/a'):
item = VodafoneScraperItem()
item['link'] = sel.xpath('href').extract()
item['name'] = sel.xpath('text()').extract_first()
yield item
答案 0 :(得分:0)
它不起作用,因为页面内容是由JavaScript动态生成的。您尝试从中提取数据的元素不会出现在Scrapy收到的HTML源代码中(您可以在浏览器中打开页面源代码时看到自己)。
您有两种选择:
答案 1 :(得分:0)
而不是请求:
start_urls = ['http://www.vodafone.com.au/about/legal/critical-information-summary/plans']
你可以将start_urls设置为:
start_urls = ['http://www.vodafone.com.au/rest/CIS?field:planCategory:equals=Mobile%20Plans&field:planFromDate:lessthaneq=22/08/2017']
将response.body转换为json格式:
response_json = json.loads(response.body)
现在它将为您提供网站上的所有对象。现在简单地在它上面循环并获得所需的数据:
for item_json in response_json:
item["link"] = item_json["document"]["file"]
item["name"] = item_json["document"]["name"]
完整的代码段在这里:
import scrapy
import json
from vodafone_scraper.items import VodafoneScraperItem
class VodafoneSpider(scrapy.Spider):
name = 'vodafone'
allowed_domains = ['vodafone.com.au']
start_urls = [
'http://www.vodafone.com.au/rest/CIS?field:planCategory:equals=Mobile%20Plans&field:planFromDate:lessthaneq=22/08/2017']
def parse(self, response):
response_json = json.loads(response.body)
for item_json in response_json:
item = VodafoneScraperItem()
item["link"] = item_json["document"]["file"]
item["book"] = item_json["document"]["name"]
yield item