Python废料加入了兄弟姐妹

时间:2017-06-14 17:58:34

标签: python scrapy

对不起,如果我错了使用标题。基本上我想用scrapy废弃这些数据:

<tr>
    <td colspan=2>
        <h4>Ottawa Macdonald-Cartier International Airport runways</h4>
    </td>
</tr>
</tr>
<tr class="odd">
    <td><a href="ottawa-macdonald-cartier-international-airport-runway-04-22-extended-info_R234949.html" title="Ottawa Macdonald-Cartier International Airport runway 04/22 extended info"><b>04/22</b></a></td>
    <td>3300x75 <small>ft.</small></td>
</tr>
<tr class="even">
    <td><a href="ottawa-macdonald-cartier-international-airport-runway-07-25-extended-info_R234950.html" title="Ottawa Macdonald-Cartier International Airport runway 07/25 extended info"><b>07/25</b></a></td>
    <td>8000x200 <small>ft.</small></td>
</tr>
<tr class="odd">
    <td><a href="ottawa-macdonald-cartier-international-airport-runway-14-32-extended-info_R234951.html" title="Ottawa Macdonald-Cartier International Airport runway 14/32 extended info"><b>14/32</b></a></td>
    <td>10000x200 <small>ft.</small></td>
</tr>
<tr class=""> different repeat each page ....

我希望输出成为csv行中的json格式。所以看起来像:

{'05/23': '3281x250 ft.','18/36': '3252x250 ft.'}

但我总是得到这样的结果:

{05/23,18/36,3281x250 ,ft.,3252x250 ,ft.}

这是我的代码:

    def parse_details(self, response):
    runway1 = response.xpath(".//tr[contains(.,'runways')]/following-sibling::tr[@class]//td/a[contains(@title,'runway')]//text()").extract()
    runway2 = response.xpath(".//tr[contains(.,'runways')]/following-sibling::tr[@class]//td[contains(.,'ft.')]//text()").extract()
    runway = runway1 + runway2
    runways = ','.join(runway)

    yield {'runways':'{'+runways+'}'}

如何使我的代码可以像我想要的那样解析?因为我在这个网站上搜索所有教程但仍然卡住了。感谢

2 个答案:

答案 0 :(得分:0)

key_1 = response.xpath('//tr[@class="odd"]//a/b/text()').extract_first()
value_1 = response.xpath('//tr[@class="odd"]//td[2]/text()').extract_first()

key_2 = response.xpath('//tr[@class="even"]//a/b/text()').extract_first()
value_2 = response.xpath('//tr[@class="even"]//td[2]/text()').extract_first()

yield {key_1: value_1, key_2: value_2}

答案 1 :(得分:0)

你可以循环tr标题的兄弟姐妹,并获取每个键/值:

In [1]: response = scrapy.Selector(text='''<tr>
   ...:     <td colspan=2>
   ...:         <h4>Ottawa Macdonald-Cartier International Airport runways</h4>
   ...:     </td>
   ...: </tr>
   ...: </tr>
   ...: <tr class="odd">
   ...:     <td><a href="ottawa-macdonald-cartier-international-airport-runway-04-22-extended-info_R234949.html" title="Ottawa Macdonald-Cartier International Airport runway 04/22 extended info"><b>04/22</b></a></td>
   ...:     <td>3300x75 <small>ft.</small></td>
   ...: </tr>
   ...: <tr class="even">
   ...:     <td><a href="ottawa-macdonald-cartier-international-airport-runway-07-25-extended-info_R234950.html" title="Ottawa Macdonald-Cartier International Airport runway 07/25 extended info"><b>07/25</b></a></td>
   ...:     <td>8000x200 <small>ft.</small></td>
   ...: </tr>
   ...: <tr class="odd">
   ...:     <td><a href="ottawa-macdonald-cartier-international-airport-runway-14-32-extended-info_R234951.html" title="Ottawa Macdonald-Cartier International Airport runway 14/32 extended info"><b>14/32</b></a></td>
   ...:     <td>10000x200 <small>ft.</small></td>
   ...: </tr>''')



In [2]: {tr.xpath('string(.//td/a[contains(@title,"runway")])').get():
   ...:      tr.xpath('string(.//td[contains(.,"ft.")])').get()
   ...:  for tr in response.xpath('.//tr[contains(., "runways")]/following-sibling::tr[@class]')    }
   ...:  
Out[2]: 
{u'04/22': u'3300x75 ft.',
 u'07/25': u'8000x200 ft.',
 u'14/32': u'10000x200 ft.'}

示例回调可能如下所示:

def parse_details(self, response):
    for tr in response.xpath('.//tr[contains(., "runways")]/following-sibling::tr[@class]'):
        yield {tr.xpath('string(.//td/a[contains(@title,"runway")])').get():
                    tr.xpath('string(.//td[contains(.,"ft.")])').get()}