好的,这就是问题所在。我是一个刚刚开始深入研究scrapy / python的初学者。
我使用下面的代码来抓取网站并将结果保存到csv中。当我查看命令提示符时,它会将像Officiële这样的单词转换为Offici \ xele。在csv文件中,它将其更改为officiële。我想这是因为它节省了unicode而不是UTF-8?然而,我有0个线索如何更改我的代码,到目前为止我一直在尝试。
有人可以帮我吗?我特别注意确保项目[“publicatietype”]正常工作。我该如何编码/解码?我需要写什么?我尝试使用replace('Ã'','ë'),但这给了我一个错误(非ASCCI字符,但没有声明编码)。
class pagespider(Spider):
name = "OBSpider"
#max_page is put here to prevent endless loops; make it as large as you need. It will try and go up to that page
#even if there's nothing there. A number too high will just take way too much time and yield no results
max_pages = 1
def start_requests(self):
for i in range(self.max_pages):
yield scrapy.Request("https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=%d&sorttype=1&sortorder=4" % (i+1), callback = self.parse)
def parse(self, response):
for sel in response.xpath('//div[@class = "lijst"]/ul/li'):
item = ThingsToGather()
item["titel"] = ' '.join(sel.xpath('a/text()').extract())
deeplink = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('a/@href').extract())])
request = scrapy.Request(deeplink, callback=self.get_page_info)
request.meta['item'] = item
yield request
def get_page_info(self, response):
for sel in response.xpath('//*[@id="Inhoud"]'):
item = response.meta['item']
#it loads some general info from the header. If this string is less than 5 characters, the site probably is a faulthy link (i.e. an error 404). If this is the case, then it drops the item. Else it continues
if len(' '.join(sel.xpath('//div[contains(@class, "logo-nummer")]/div[contains(@class, "nummer")]/text()').extract())) < 5:
raise DropItem()
else:
item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
item["publicatietype"] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
item = self.__normalise_item(item, response.url)
#if the string is less than 5, then the required data is not on the page. It then needs to be
#retrieved from the technical information link. If it's the proper link (the else clause), you're done and it proceeds to 'else'
if len(item['publicatiedatum']) < 5:
tech_inf_link = ''.join(["https://zoek.officielebekendmakingen.nl/", ' '.join(sel.xpath('//*[@id="technischeInfoHyperlink"]/@href').extract())])
request = scrapy.Request(tech_inf_link, callback=self.get_date_info)
request.meta['item'] = item
yield request
else:
yield item
def get_date_info (self, response):
for sel in response.xpath('//*[@id="Inhoud"]'):
item = response.meta['item']
item["filename"] = sel.xpath('//span[contains(@property, "http://standaarden.overheid.nl/oep/meta/publicationName")]/text()').extract()
item['publicatiedatum'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()
item['publicatietype'] = sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/type")]/text()').extract()
item["filename"] = ' '.join(sel.xpath('//*[@id="downloadPdfHyperLink"]/@href').extract())
item = self.__normalise_item(item, response.url)
return item
# commands below are intended to clean up strings. Everything is sent to __normalise_item to clean unwanted characters (strip) and double spaces (split)
def __normalise_item(self, item, base_url):
for key, value in vars(item).values()[0].iteritems():
item[key] = self.__normalise(item[key])
item ['titel']= item['titel'].replace(';', '& ')
return item
def __normalise(self, value):
value = value if type(value) is not list else ' '.join(value)
value = value.strip()
value = " ".join(value.split())
return value
解答:
请参阅以下paul trmbrth的评论。问题不在于scrapy,而是优秀。
对于遇到这个问题的人也是如此。 tldr是:导入excel中的数据(在功能区的数据菜单中)并将Windows(ANSI)或其他任何内容切换为Unicode(UTF-8)。
答案 0 :(得分:1)
Officiële
将在Python 2中表示为u'Offici\xeble'
,如下面的python shell会话示例所示(无需担心\xXX
个字符,它只是Python代表非-ASCII Unicode字符)
$ python
Python 2.7.9 (default, Apr 2 2015, 15:33:21)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u'Officiële'
u'Offici\xeble'
>>> u'Offici\u00EBle'
u'Offici\xeble'
>>>
我认为这是因为它节省了unicode而不是UTF-8
UTF-8 是一种编码, Unicode 不是。
ë
,a.k.a U+00EB
,a.k.a LATIN SMALL LETTER E WITH DIAERESIS
,将UTF-8编码为2个字节,\xc3
和\xab
>>> u'Officiële'.encode('UTF-8')
'Offici\xc3\xable'
>>>
在csv文件中,它将其更改为officiële。
如果您看到这一点,则可能需要在程序中打开CSV文件时将输入编码设置为UTF-8。
Scrapy CSV导出器会将Python Unicode字符串编写为输出文件中的UTF-8编码字符串。
Scrapy选择器将输出Unicode字符串:
$ scrapy shell "https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4"
2016-03-15 10:44:51 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
(...)
2016-03-15 10:44:52 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
(...)
In [1]: response.css('div.menu-bmslink > ul > li > a::text').extract()
Out[1]:
[u'Offici\xeble bekendmakingen vandaag',
u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011',
u'Uitleg nieuwe\r\n nummering Staatscourant vanaf 1 juli 2009']
In [2]: for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
print t
...:
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
Uitleg nieuwe
nummering Staatscourant vanaf 1 juli 2009
让我们看看在项目中提取这些字符串的蜘蛛会将您视为CSV:
$ cat testspider.py
import scrapy
class TestSpider(scrapy.Spider):
name = 'testspider'
start_urls = ['https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4']
def parse(self, response):
for t in response.css('div.menu-bmslink > ul > li > a::text').extract():
yield {"link": t}
运行蜘蛛并要求输出CSV:
$ scrapy runspider testspider.py -o test.csv
2016-03-15 11:00:13 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
2016-03-15 11:00:13 [scrapy] INFO: Optional features available: ssl, http11
2016-03-15 11:00:13 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'csv', 'FEED_URI': 'test.csv'}
2016-03-15 11:00:14 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-03-15 11:00:14 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-03-15 11:00:14 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-03-15 11:00:14 [scrapy] INFO: Enabled item pipelines:
2016-03-15 11:00:14 [scrapy] INFO: Spider opened
2016-03-15 11:00:14 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-03-15 11:00:14 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-03-15 11:00:14 [scrapy] DEBUG: Crawled (200) <GET https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4> (referer: None)
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Offici\xeble bekendmakingen vandaag'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011'}
2016-03-15 11:00:14 [scrapy] DEBUG: Scraped from <200 https://zoek.officielebekendmakingen.nl/zoeken/resultaat/?zkt=Uitgebreid&pst=Tractatenblad|Staatsblad|Staatscourant|BladGemeenschappelijkeRegeling|ParlementaireDocumenten&vrt=Cybersecurity&zkd=InDeGeheleText&dpr=Alle&sdt=general_informationPublicatie&ap=&pnr=18&rpp=10&_page=1&sorttype=1&sortorder=4>
{'link': u'Uitleg nieuwe\r\n nummering Staatscourant vanaf 1 juli 2009'}
2016-03-15 11:00:14 [scrapy] INFO: Closing spider (finished)
2016-03-15 11:00:14 [scrapy] INFO: Stored csv feed (3 items) in: test.csv
2016-03-15 11:00:14 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 488,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 12018,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 991735),
'item_scraped_count': 3,
'log_count/DEBUG': 5,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 3, 15, 10, 0, 14, 59471)}
2016-03-15 11:00:14 [scrapy] INFO: Spider closed (finished)
检查CSV文件的内容:
$ cat test.csv
link
Officiële bekendmakingen vandaag
Uitleg nieuwe nummering Handelingen vanaf 1 januari 2011
"Uitleg nieuwe
nummering Staatscourant vanaf 1 juli 2009"
$ hexdump -C test.csv
00000000 6c 69 6e 6b 0d 0a 4f 66 66 69 63 69 c3 ab 6c 65 |link..Offici..le|
00000010 20 62 65 6b 65 6e 64 6d 61 6b 69 6e 67 65 6e 20 | bekendmakingen |
00000020 76 61 6e 64 61 61 67 0d 0a 55 69 74 6c 65 67 20 |vandaag..Uitleg |
00000030 6e 69 65 75 77 65 20 6e 75 6d 6d 65 72 69 6e 67 |nieuwe nummering|
00000040 20 48 61 6e 64 65 6c 69 6e 67 65 6e 20 76 61 6e | Handelingen van|
00000050 61 66 20 31 20 6a 61 6e 75 61 72 69 20 32 30 31 |af 1 januari 201|
00000060 31 0d 0a 22 55 69 74 6c 65 67 20 6e 69 65 75 77 |1.."Uitleg nieuw|
00000070 65 0d 0a 20 20 20 20 20 20 20 20 20 20 20 20 6e |e.. n|
00000080 75 6d 6d 65 72 69 6e 67 20 53 74 61 61 74 73 63 |ummering Staatsc|
00000090 6f 75 72 61 6e 74 20 76 61 6e 61 66 20 31 20 6a |ourant vanaf 1 j|
000000a0 75 6c 69 20 32 30 30 39 22 0d 0a |uli 2009"..|
000000ab
您可以验证ë
是否已正确编码为c3 ab
我在使用LibreOffice时可以正确看到文件数据(注意“字符集:Unicode UTF-8”):
你可能正在使用Latin-1。以下是使用Latin-1而不是UTF-8作为输入编码设置时的结果(再次在LibreOffice中):
答案 1 :(得分:0)
要对字符串进行编码,您可以直接使用encode("utf-8")
。像这样:
item['publicatiedatum'] = ''.join(sel.xpath('//span[contains(@property, "http://purl.org/dc/terms/available")]/text()').extract()).encode("utf-8")