抓取生成的内容并提取链接

时间:2016-05-24 20:01:54

标签: javascript python xpath scrapy-spider



Stack Overflow中有许多Q / A页面,其结论似乎是使用javascript的网页无法使用蜘蛛进行爬网和提取。或者至少说这个动作在某些情况下只是有限且缓慢。我想知道我想要做什么是爬行蜘蛛,如果没有可能的替代方案来实现我的目标。

我有以下蜘蛛的代码,但我不知道xpath(如果有的话),这将允许我在这个网页上提取垃圾场网址的第一页:{{3 }}

如果我知道要在"网站"

中提取的内容的xpath中放入什么内容,我将使用以下代码。
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser

lkqlist = 'http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/%s?lat=%s&lng=%s' % (zipcode, latitude, longitude)

class JunkYardSites(scrapy.Item):
    Sites = scrapy.Field()

class LkqLocationList(scrapy.Spider):
    name = "lkqlist"
    allowed_domains = ["lkqcorp.com"]
    start_urls = (
        lkqlist,    
    )
    def parse(self, response):
        sites = response.xpath('XPATH WOULD GO HERE').extract()
        for element in range(0, len(sites), 1):
            item = JunkYardSites()
            item["Sites"] = sites.pop(0)           
            yield item

谢谢,我是Python和Scrapy的新手,所以感谢我能得到的任何帮助或指导。

1 个答案:

答案 0 :(得分:0)

我不确定您要提取哪些数据; "垃圾场网址的第一页"" - 仅限URL或周围文​​本。

'//td[@class="basicviewbold"]/script/text()'

将为您提供

等结果
2016-05-24 22:51:46 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'\r\n                                            var URL = localizeStoreUrlToCulture("http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Durham-142/", local.culture);\r\n                                            var name = "LKQ Pick Your Part - Durham";\r\n                                            var anchor = \'<a href="\' + URL + \'" class="lkqro_locLink lkqro_nameLink" >\' + name + \'</a> - 14.54<span id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_lblDistanceType_0">mi</span> \';\r\n                                            document.write(anchor);\r\n 

如果您只是希望URL使用带有正则表达式的XPATH而不是extract():

sites = response.xpath("//td[@class='basicviewbold']/script/text()").re(r'(?:localizeStoreUrlToCulture\(\")(.*)(?:\", local.culture)')

给出的结果如下:

2016-05-25 00:37:54 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Durham-142/'}
2016-05-25 00:37:54 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Raleigh-168/'}

'//table[tr/td[@class="basicviewbold"]]'

将为您提供整个表格,您也可以解析该地址:

2016-05-24 23:06:24 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'<table>\r\n                                 <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trBusinessName_0">\r\n\t\t\t\t<td class="basicviewbold">\r\n                                        <img src="/desktopmodules/lkq_ROLocator/geosprawlimages/a.png" id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_imgLetterMarker_0" border="0" alt="Map Marker A" class="lkqro_letterImage">\r\n \r\n                                        <script>\r\n                                            var URL = localizeStoreUrlToCulture("http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Durham-142/", local.culture);\r\n                                            var name = "LKQ Pick Your Part - Durham";\r\n                                            var anchor = \'<a href="\' + URL + \'" class="lkqro_locLink lkqro_nameLink" >\' + name + \'</a> - 14.54<span id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_lblDistanceType_0">mi</span> \';\r\n                                            document.write(anchor);\r\n                                        </script>\r\n                                                                   \r\n                                        <input type="hidden" name="dnn$ctr952$ControlLoader$BizSearchResult$rptSearchResults$ctl00$hdnRptRowLat" id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_hdnRptRowLat_0" value="35.959784">\r\n                                        <input type="hidden" name="dnn$ctr952$ControlLoader$BizSearchResult$rptSearchResults$ctl00$hdnRptRowLong" id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_hdnRptRowLong_0" value="-78.841397">\r\n                                        <input type="hidden" class="lkqro_division" value="Self-Service">\r\n                                    </td>\r\n\t\t\t</tr>\r\n\t\t\t\r\n                                <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trDistance_0" style="display:none;">\r\n\t\t\t</tr>\r\n\t\t\t\r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trFullAddress_0">\r\n\t\t\t\t<td class="basicview">\r\n                                        1301 S. Miami Blvd\r\n                                        \r\n                                        <br>\r\n                                     \r\n                                     \r\n                                        Durham\r\n                                        NC\r\n                                        27703                                   \r\n                                   </td>\r\n\t\t\t</tr>\r\n\t\t\t\r\n                                \r\n                                <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trPhone_0">\r\n\t\t\t\t<td class="basicview">\r\n                                        <script>\r\n                                            var phoneNumber = "(800) 962-2277";\r\n                                            var internationalPhone = \'+1-\' + phoneNumber.replace(\' \', \'\');\r\n                                            internationalPhone = internationalPhone.replace(\'(\', \'\');\r\n                                            internationalPhone = internationalPhone.replace(\')\', \'-\');\r\n                                       \r\n                                            var clickToCall = \'<a class="lkq_clickToCall" href="tel:\' + internationalPhone + \'">\' + phoneNumber + \'</a>\';\r\n                                            var phoneDesktop = \'<span class="lkq_phoneDesktop">\' + phoneNumber + \'</span>\';\r\n                                            document.write(clickToCall + phoneDesktop);\r\n                                        </script>\r\n                                         \r\n                                    </td>\r\n\t\t\t</tr>\r\n\t\t\t\r\n                                <tr id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_trWebsite_0">\r\n\t\t\t\t<td class="basicview">\r\n                                        <script>\r\n                                            var URL = localizeStoreUrlToCulture("http://www.lkqpickyourpart.com/locations/LKQ_Pick_Your_Part_-_Durham-142/", local.culture);\r\n                                            var anchor = \'<a href="\' + URL + \'" class="lkqro_locLink lkqro_pageURL" >\' + local.locationPage + \'</a>\';\r\n                                            document.write(anchor);\r\n                                        </script>\r\n                                        \xa0|\xa0\r\n                                        <a id="dnn_ctr952_ControlLoader_BizSearchResult_rptSearchResults_lnkFullMap_0" class="lkqro_locLink" href="http://maps.google.com/maps/dir//1301+S.+Miami+Blvd%2c+Durham%2c+27703/@35.959784,-78.841397,17z/?hl=en" target="_blank"><script>document.write(local.getDirections);</script></a>\r\n                                    </td>\r\n\t\t\t</tr>\r\n\t\t\t\r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                \r\n                                <tr>\r\n                                    <td class="lkqro_itemSpacer">\r\n                                       <hr>\r\n                                    </td>\r\n                                </tr>\r\n                            \r\n                               </table>'}

数据的结构方式

'//td[@class="basicview"]'

会给你地址,电话号码等。

2016-05-24 23:13:16 [scrapy] DEBUG: Scraped from <200 http://www.lkqcorp.com/en-us/locationResults/tag/All/m/24902/fullcrit/27517?lat=35.8263369&lng=-79.0419053>
{u'Sites': u'<td class="basicview">\r\n                                        <script>\r\n                                            var phoneNumber = "(800) 962-2277";\r\n                                            var internationalPhone = \'+1-\' + phoneNumber.replace(\' \', \'\');\r\n                                            internationalPhone = internationalPhone.replace(\'(\', \'\');\r\n                                            internationalPhone = internationalPhone.replace(\')\', \'-\');\r\n                                       \r\n                                            var clickToCall = \'<a class="lkq_clickToCall" href="tel:\' + internationalPhone + \'">\' + phoneNumber + \'</a>\';\r\n                                            var phoneDesktop = \'<span class="lkq_phoneDesktop">\' + phoneNumber + \'</span>\';\r\n                                            document.write(clickToCall + phoneDesktop);\r\n                                        </script>\r\n                                         \r\n                                    </td>'}