Question

我尝试使用Scrapy从geonames.org中删除信息。更具体地说，我想检索每个国家的10个最大城市。我的起始网址是http://www.geonames.org/countries/。在此页面上，我想要遵循符合正则表达式的每个URL：

/国家/ \ W {2} / .. HTML

然后在跟随的页面（即国家/地区页面）上，我想要使用以下结构跟踪网址http://www.geonames.org/ XX / maximum-cities-in- YYYY < /strong>.html 其中 XX 是双字母国家/地区代码， YYYY 是显然可以是可变长度的国家/地区的实际名称。下面的代码不起作用。我怀疑它是由于第二条规则的正则表达式问题所致。但也许不是！

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor import re import os class MySpider(CrawlSpider): name = 'geocodeSpider' allowed_domains = ['www.geonames.org'] start_urls = ['http://www.geonames.org/countries/'] fileName="largest_cities.txt" try: os.remove(os.path.join('geocode/output',fileName)) except OSError: pass rules = ( Rule(LinkExtractor(allow=(r'/countries/\w{2}/.\.html', )),), Rule(LinkExtractor(allow=(r'/\w{2}/largest-cities-in-.\.html', )), callback='parse_item'), ) def parse_item(self, response): ...

Scrapy规则和正则表达

0 个答案: