Question

我正在抓取此页面以获取每个广告的数据： http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/？

每个广告都在一个名为内容的类中，所以我写了一个for循环来获取所有内容类，然后获取每个内容的数据＆＃34; Ad＆＃34;拼命地，但我得到每个循环迭代中的所有内容的数据。这是我在scrapy shell中的代码：

scrapy shell "http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/"
for content in response.xpath('//*[@class="pitem"]/div[1]/div[2]/div[1]'):
          print content.xpath('//*[@class="detail"]/p/text()[2]').extract()

但输出是：

[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']
[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']
[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']
[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']
[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']
[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']
[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']
[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']
[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']
[u' 48 months', u' 48 months', u' 48 months', u' 36 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 48 months', u' 36 months']

这意味着它在每次迭代中获取所有标签的数据!! 我需要输出：

48 months
48 months
48 months
36 months
48 months
48 months
48 months
48 months
48 months
36 months

Answer 1

要获取每个广告的数据，您可以使用以下代码：

def parse(self, response):
    for detail in response.xpath('//div[@class="detail"]/p'):
        item = dict()
        item['term'] = detail.xpath('text()[2]').extract_first()
        item['mileage'] = detail.xpath('text()[4]').extract_first()
        item['payment'] = detail.xpath('text()[6]').extract_first()
        item['fee'] = detail.xpath('text()[8]').extract_first()
        yield item
# {'term': ' 48 months', 'mileage': ' 10,000', 'payment': ' £2,227.86 + VAT', 'fee': ' &pound249.00 + VAT'}

请注意，您需要使用extract_first()方法，因为extract()会返回一个列表。

Answer 2

你可以直接使用xpath选择class =“detail”的元素，改变你的代码如下：

In [5]: for content in response.xpath('//*[@class="detail"]/p/text()[2]').extrac
t():
   ...:     print content
   ...:
 48 months
 48 months
 48 months
 36 months
 48 months
 48 months
 48 months
 48 months
 48 months
 36 months

如何获取每个标签的数据？

2 个答案: