对不起,如果我错了使用标题。基本上我想用scrapy废弃这些数据:
<tr>
<td colspan=2>
<h4>Ottawa Macdonald-Cartier International Airport runways</h4>
</td>
</tr>
</tr>
<tr class="odd">
<td><a href="ottawa-macdonald-cartier-international-airport-runway-04-22-extended-info_R234949.html" title="Ottawa Macdonald-Cartier International Airport runway 04/22 extended info"><b>04/22</b></a></td>
<td>3300x75 <small>ft.</small></td>
</tr>
<tr class="even">
<td><a href="ottawa-macdonald-cartier-international-airport-runway-07-25-extended-info_R234950.html" title="Ottawa Macdonald-Cartier International Airport runway 07/25 extended info"><b>07/25</b></a></td>
<td>8000x200 <small>ft.</small></td>
</tr>
<tr class="odd">
<td><a href="ottawa-macdonald-cartier-international-airport-runway-14-32-extended-info_R234951.html" title="Ottawa Macdonald-Cartier International Airport runway 14/32 extended info"><b>14/32</b></a></td>
<td>10000x200 <small>ft.</small></td>
</tr>
<tr class=""> different repeat each page ....
我希望输出成为csv行中的json格式。所以看起来像:
{'05/23': '3281x250 ft.','18/36': '3252x250 ft.'}
但我总是得到这样的结果:
{05/23,18/36,3281x250 ,ft.,3252x250 ,ft.}
这是我的代码:
def parse_details(self, response):
runway1 = response.xpath(".//tr[contains(.,'runways')]/following-sibling::tr[@class]//td/a[contains(@title,'runway')]//text()").extract()
runway2 = response.xpath(".//tr[contains(.,'runways')]/following-sibling::tr[@class]//td[contains(.,'ft.')]//text()").extract()
runway = runway1 + runway2
runways = ','.join(runway)
yield {'runways':'{'+runways+'}'}
如何使我的代码可以像我想要的那样解析?因为我在这个网站上搜索所有教程但仍然卡住了。感谢
答案 0 :(得分:0)
key_1 = response.xpath('//tr[@class="odd"]//a/b/text()').extract_first()
value_1 = response.xpath('//tr[@class="odd"]//td[2]/text()').extract_first()
key_2 = response.xpath('//tr[@class="even"]//a/b/text()').extract_first()
value_2 = response.xpath('//tr[@class="even"]//td[2]/text()').extract_first()
yield {key_1: value_1, key_2: value_2}
答案 1 :(得分:0)
你可以循环tr
标题的兄弟姐妹,并获取每个键/值:
In [1]: response = scrapy.Selector(text='''<tr>
...: <td colspan=2>
...: <h4>Ottawa Macdonald-Cartier International Airport runways</h4>
...: </td>
...: </tr>
...: </tr>
...: <tr class="odd">
...: <td><a href="ottawa-macdonald-cartier-international-airport-runway-04-22-extended-info_R234949.html" title="Ottawa Macdonald-Cartier International Airport runway 04/22 extended info"><b>04/22</b></a></td>
...: <td>3300x75 <small>ft.</small></td>
...: </tr>
...: <tr class="even">
...: <td><a href="ottawa-macdonald-cartier-international-airport-runway-07-25-extended-info_R234950.html" title="Ottawa Macdonald-Cartier International Airport runway 07/25 extended info"><b>07/25</b></a></td>
...: <td>8000x200 <small>ft.</small></td>
...: </tr>
...: <tr class="odd">
...: <td><a href="ottawa-macdonald-cartier-international-airport-runway-14-32-extended-info_R234951.html" title="Ottawa Macdonald-Cartier International Airport runway 14/32 extended info"><b>14/32</b></a></td>
...: <td>10000x200 <small>ft.</small></td>
...: </tr>''')
In [2]: {tr.xpath('string(.//td/a[contains(@title,"runway")])').get():
...: tr.xpath('string(.//td[contains(.,"ft.")])').get()
...: for tr in response.xpath('.//tr[contains(., "runways")]/following-sibling::tr[@class]') }
...:
Out[2]:
{u'04/22': u'3300x75 ft.',
u'07/25': u'8000x200 ft.',
u'14/32': u'10000x200 ft.'}
示例回调可能如下所示:
def parse_details(self, response):
for tr in response.xpath('.//tr[contains(., "runways")]/following-sibling::tr[@class]'):
yield {tr.xpath('string(.//td/a[contains(@title,"runway")])').get():
tr.xpath('string(.//td[contains(.,"ft.")])').get()}