我已经构建了一个从csv文件中读取artistnames
的刮刀,并通过这些艺术家的artistdata
api收集Songkick
。但是,运行我的代码一段时间后,我收到以下错误:
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte
可以下载样本数据here:
我对编码比较陌生,我想知道如何解决这个错误?您可以在下面找到我的代码。
import urllib2
import requests
import json
import csv
from tinydb import TinyDB, Query
db = TinyDB('spotify_artists.json')
#read csv
def wait_for_internet():
while True:
try:
resp = urllib2.urlopen('http://google.com', timeout=1)
return
except:
pass
def load_artists():
f = open('artistnames.csv', 'r').readlines();
for a in f:
artist = a.strip()
print(artist)
url = 'http://api.songkick.com/api/3.0/search/artists.json?query='+artist+'&apikey='
# wait_for_internet()
r = requests.get(url)
resp = r.json()
# print(resp)
try :
if(resp['resultsPage']['totalEntries']):
# print(json.dumps(resp['resultsPage']['results']['artist'], indent=4, sort_keys=True))
results = resp['resultsPage']['results']['artist'];
for x in results:
# print('rxx')
# print(json.dumps(x, indent=4, sort_keys=True))
if(x['displayName'] == artist):
print(x)
db.insert(x)
except:
print('cannot fetch url',url);
load_artists()
db.close()
Traceback (most recent call last):
File "C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py", line 45, in <module>
load_artists()
File "C:C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py".py", line 25, in load_artists
r = requests.get(url)
File "C:\Python27\lib\site-packages\requests\api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "C:\Python27\lib\site-packages\requests\api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 474, in request
prep = self.prepare_request(req)
File "C:\Python27\lib\site-packages\requests\sessions.py", line 407, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Python27\lib\site-packages\requests\models.py", line 302, in prepare
self.prepare_url(url, params)
File "C:\Python27\lib\site-packages\requests\models.py", line 358, in prepare_url
url = url.decode('utf8')
File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte
答案 0 :(得分:0)
问题在于您构建了一个URL,您将查询字符串作为bytes
(Python 2.x上的常规str
)传递给{的非utf-8编码字符{1}}模块,它反过来试图将其转换为utf-8 unicode字符串,然后失败。
首先,您应该让requests
模块形成您的查询字符串并处理最终网址的创建:
requests
但其次,你不应该混淆你想要成为一个受伤世界的编码。不幸的是,内置的url = "http://api.songkick.com/api/3.0/search/artists.json"
r = requests.get(url, params={"query": artist, "apikey": ""})
# etc.
模块不能与Unicode一起使用,这可能是你最终使用无效字符的原因。要解决此问题,请安装unicodecsv并将其用作替代品(只需将csv
替换为import csv
)。
更新:等等,再看看你甚至没有使用csv。您正在逐行读取文件并尝试将其作为查询传递。这是你的预期行为吗?如果是这种情况,请遵循保持相同编码的想法:
import unicodecsv as csv
答案 1 :(得分:0)
你应该尽可能使用unicode。请求应将url中的任何非ascii字符转换为正确的编码。
>>> import requests
>>> requests.get(u'http://Motörhead.com/?q=Motörhead').url
u'http://xn--motrhead-p4a.com/?q=Mot%C3%B6rhead'
如您所见,域名编码为punycode,查询字符串使用percent-encoding。
只要artist
是有效的unicode字符串,这应该有效。
url = u'http://api.songkick.com/api/3.0/search/artists.json?query='+artist
如果artist
是字节字符串,则必须使用正确的编码将其解码为unicode,这取决于原始输入文件的编码方式。
artist = artist.decode('SHIFT-JIS')