Stack Overflow中有许多Q / A页面,其结论似乎是使用javascript的网页无法使用蜘蛛进行爬网和提取。或者至少说这个动作在某些情况下只是有限且缓慢。我想知道我想要做什么是爬行蜘蛛,如果没有可能的替代方案来实现我的目标。
我有以下蜘蛛的代码,但我不知道xpath(如果有的话),这将允许我在这个网页上提取垃圾场网址的第一页:{{3 }}
如果我知道要在"网站"
中提取的内容的xpath中放入什么内容,我将使用以下代码。# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser
lkqlist = 'http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/%s?lat=%s&lng=%s' % (zipcode, latitude, longitude)
class JunkYardSites(scrapy.Item):
Sites = scrapy.Field()
class LkqLocationList(scrapy.Spider):
name = "lkqlist"
allowed_domains = ["lkqcorp.com"]
start_urls = (
lkqlist,
)
def parse(self, response):
sites = response.xpath('XPATH WOULD GO HERE').extract()
for element in range(0, len(sites), 1):
item = JunkYardSites()
item["Sites"] = sites.pop(0)
yield item
谢谢,我是Python和Scrapy的新手,所以感谢我能得到的任何帮助或指导。
答案 0 :(得分:0)
我不确定您要提取哪些数据; "垃圾场网址的第一页"" - 仅限URL或周围文本。
'//td[@class="basicviewbold"]/script/text()'
将为您提供
等结果2016-05-24 22:51:46 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'\r\n var URL = localizeStoreUrlToCulture("http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Durham-142/", local.culture);\r\n var name = "LKQ Pick Your Part - Durham";\r\n var anchor = \'<a href="\' + URL + \'" class="lkqro_locLink lkqro_nameLink" >\' + name + \'</a> - 14.54<span id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_lblDistanceType_0">mi</span> \';\r\n document.write(anchor);\r\n
如果您只是希望URL使用带有正则表达式的XPATH而不是extract():
sites = response.xpath("//td[@class='basicviewbold']/script/text()").re(r'(?:localizeStoreUrlToCulture\(\")(.*)(?:\", local.culture)')
给出的结果如下:
2016-05-25 00:37:54 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Durham-142/'}
2016-05-25 00:37:54 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Raleigh-168/'}
或
'//table[tr/td[@class="basicviewbold"]]'
将为您提供整个表格,您也可以解析该地址:
2016-05-24 23:06:24 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'<table>\r\n <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trBusinessName_0">\r\n\t\t\t\t<td class="basicviewbold">\r\n <img src="/desktopmodules/lkq_ROLocator/geosprawlimages/a.png" id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_imgLetterMarker_0" border="0" alt="Map Marker A" class="lkqro_letterImage">\r\n \r\n <script>\r\n var URL = localizeStoreUrlToCulture("http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Durham-142/", local.culture);\r\n var name = "LKQ Pick Your Part - Durham";\r\n var anchor = \'<a href="\' + URL + \'" class="lkqro_locLink lkqro_nameLink" >\' + name + \'</a> - 14.54<span id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_lblDistanceType_0">mi</span> \';\r\n document.write(anchor);\r\n </script>\r\n \r\n <input type="hidden" name="dnn$ctr952$ControlLoader$BizSearchResult$rptSearchResults$ctl00$hdnRptRowLat" id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_hdnRptRowLat_0" value="35.959784">\r\n <input type="hidden" name="dnn$ctr952$ControlLoader$BizSearchResult$rptSearchResults$ctl00$hdnRptRowLong" id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_hdnRptRowLong_0" value="-78.841397">\r\n <input type="hidden" class="lkqro_division" value="Self-Service">\r\n </td>\r\n\t\t\t</tr>\r\n\t\t\t\r\n <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trDistance_0" style="display:none;">\r\n\t\t\t</tr>\r\n\t\t\t\r\n \r\n \r\n \r\n \r\n \r\n <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trFullAddress_0">\r\n\t\t\t\t<td class="basicview">\r\n 1301 S. Miami Blvd\r\n \r\n <br>\r\n \r\n \r\n Durham\r\n NC\r\n 27703 \r\n </td>\r\n\t\t\t</tr>\r\n\t\t\t\r\n \r\n <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trPhone_0">\r\n\t\t\t\t<td class="basicview">\r\n <script>\r\n var phoneNumber = "(800) 962-2277";\r\n var internationalPhone = \'+1-\' + phoneNumber.replace(\' \', \'\');\r\n internationalPhone = internationalPhone.replace(\'(\', \'\');\r\n internationalPhone = internationalPhone.replace(\')\', \'-\');\r\n \r\n var clickToCall = \'<a class="lkq_clickToCall" href="tel:\' + internationalPhone + \'">\' + phoneNumber + \'</a>\';\r\n var phoneDesktop = \'<span class="lkq_phoneDesktop">\' + phoneNumber + \'</span>\';\r\n document.write(clickToCall + phoneDesktop);\r\n </script>\r\n \r\n </td>\r\n\t\t\t</tr>\r\n\t\t\t\r\n <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trWebsite_0">\r\n\t\t\t\t<td class="basicview">\r\n <script>\r\n var URL = localizeStoreUrlToCulture("http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Durham-142/", local.culture);\r\n var anchor = \'<a href="\' + URL + \'" class="lkqro_locLink lkqro_pageURL" >\' + local.locationPage + \'</a>\';\r\n document.write(anchor);\r\n </script>\r\n \xa0|\xa0\r\n <a id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_lnkFullMap_0" class="lkqro_locLink" href="http://maps.google.com/maps/dir//1301+S.+Miami+Blvd%2c+Durham%2c+27703/@35.959784,-78.841397,17z/?hl=en" target="_blank"><script>document.write(local.getDirections);</script></a>\r\n </td>\r\n\t\t\t</tr>\r\n\t\t\t\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n <tr>\r\n <td class="lkqro_itemSpacer">\r\n <hr>\r\n </td>\r\n </tr>\r\n \r\n </table>'}
数据的结构方式
'//td[@class="basicview"]'
会给你地址,电话号码等。
2016-05-24 23:13:16 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'<td class="basicview">\r\n <script>\r\n var phoneNumber = "(800) 962-2277";\r\n var internationalPhone = \'+1-\' + phoneNumber.replace(\' \', \'\');\r\n internationalPhone = internationalPhone.replace(\'(\', \'\');\r\n internationalPhone = internationalPhone.replace(\')\', \'-\');\r\n \r\n var clickToCall = \'<a class="lkq_clickToCall" href="tel:\' + internationalPhone + \'">\' + phoneNumber + \'</a>\';\r\n var phoneDesktop = \'<span class="lkq_phoneDesktop">\' + phoneNumber + \'</span>\';\r\n document.write(clickToCall + phoneDesktop);\r\n </script>\r\n \r\n </td>'}