我正在尝试从以下页面中提取地址名称:https://property.spatialest.com/nc/durham/#/property/100016
property_spider.py
:
from scrapy import Spider
from scrapy.selector import Selector
from property.items import PropertyItem
class PropertySpider(Spider):
name = "property"
allowed_domains = ["property.spatialest.com"]
start_urls = [
"http://property.spatialest.com/nc/durham/#/property/100016"
]
def parse(self, response):
address = Selector(response).xpath("//html/body/main/div/div[2]/div/div[1]/div[2]/div/section/div/div[1]/div[2]/header/div/div/div[1]/div[2]/span")
address_item = PropertyItem()
address_item['address'] = address.xpath('span[@class="value "]/text()').extract()
yield address_item
蜘蛛每次返回{'address': []}
。我认为也许我告诉它提取数据的方式有问题。
更新:
由于请求在'#'处被切断,因此它似乎未提取任何数据
RESPONSE: <200 https://property.spatialest.com/nc/durham/>
2019-03-16 13:59:12 [scrapy.core.scraper] DEBUG: Scraped from <200 https://property.spatialest.com/nc/durham/>
{'address': []}```
答案 0 :(得分:0)
该网站使用其他请求来返回您所需的数据。
如果打开开发人员工具,则可以看到请求返回所需数据的请求。
网址:https://property.spatialest.com/nc/durham/data/propertycard
方法:POST
正文:parcelid=100016&card=&year=&debug%5BcurrentURL%5D=https%3A%2F%2Fproperty.spatialest.com%2Fnc%2Fdurham%2F%23%2Fproperty%2F100016&debug%5BpreviousURL%22%5D=
响应是json,您可以在此处找到所有数据。
所以您应该在scrapy内发出请求以获取数据