Question

我是python的新手。我通常使用php来抓取数据。我正在尝试切换到python。我正在从这里开始学习。

http://doc.scrapy.org/en/latest/intro/tutorial.html

我希望从这个维基百科页面抓取国家/地区和首都。 https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order

我的蜘蛛程序是：

import scrapy

class CountrySpider(scrapy.Spider):
    name = "countryCapitals"
    allowed_domains = ["wikipedia.org"]
    start_urls = [
                    "https://en.wikipedia.org/wiki/List_of_national_capitals_in_alphabetical_order"
                    ]

    def parse(self, response):
            for sel in response.xpath('//*[@id="mw-content-text"]/table[2]/tbody/tr'):
                    country = sel.xpath('//td[1]').extract()
                    capital = sel.xpath('td[2]/b/span.text()').extract()
                    print country , capital

它没有按照预期打印任何数据。对此有任何帮助表示赞赏。

Answer 1

浏览器控制台中显示的HTML似乎与原始源代码略有不同。例如，像@furas指出的那样，tdoby标签是问题的一部分。但是提取大写文本的xpath也是不正确的。

我使用下面的解析方法进行了测试，它对我很好，我也改变了国家xpath以提取国家/地区文本。

def parse(self, response):
        for sel in response.xpath('//*[@id="mw-content-text"]/table[2]/tr'):
                country = sel.xpath('td[1]/a/text()').extract()
                capital = sel.xpath('td[2]//a/text()').extract()
                print country , capital

部分输出示例：

[u'Abu Dhabi'] [u'United Arab Emirates']
[u'Abuja'] [u'Nigeria']
[u'Accra'] [u'Ghana']
[u'Adamstown'] [u'Pitcairn Islands']
[u'Addis Ababa'] [u'Ethiopia']
[u'Algiers'] [u'Algeria']
[u'Alofi'] [u'Niue']
[u'Amman'] [u'Jordan']

Answer 2

我测试了你的代码。我认为问题在于你的xpath。我假设你使用chrome功能来复制xpath。我自己对xpath并不擅长。我尝试使用.css（）方法打印出值。我用过：

print response.css('div.mw-content-ltr > table').extract()

工作正常。要获取第二个表，只需将该表的类或id放在上面的行中。我相信它应该可以正常工作。

Scrapy Spider不提取xpath数据

2 个答案: