Question

我已经构建了一个从csv文件中读取artistnames的刮刀，并通过这些艺术家的artistdata api收集Songkick。但是，运行我的代码一段时间后，我收到以下错误：

  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte

可以下载样本数据here：

我对编码比较陌生，我想知道如何解决这个错误？您可以在下面找到我的代码。

            import urllib2
            import requests
            import json
            import csv

            from tinydb import TinyDB, Query
            db = TinyDB('spotify_artists.json')

            #read csv
            def wait_for_internet():
                while True:
                  try:
                    resp = urllib2.urlopen('http://google.com', timeout=1)
                    return
                  except:
                    pass

            def load_artists():
                    f = open('artistnames.csv', 'r').readlines();
                    for a in f:
                        artist = a.strip()
                        print(artist)
                        url = 'http://api.songkick.com/api/3.0/search/artists.json?query='+artist+'&apikey='
                        # wait_for_internet()
                        r = requests.get(url)
                        resp = r.json()
                        # print(resp)
                        try :
                          if(resp['resultsPage']['totalEntries']):
                            # print(json.dumps(resp['resultsPage']['results']['artist'], indent=4, sort_keys=True))
                            results = resp['resultsPage']['results']['artist'];
                            for x in results:
                            #   print('rxx')
                            #   print(json.dumps(x, indent=4, sort_keys=True))

                              if(x['displayName'] == artist):
                                print(x)
                                db.insert(x)

                        except:
                          print('cannot fetch url',url);



            load_artists()
            db.close()

Traceback (most recent call last):
  File "C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py", line 45, in <module>
    load_artists()
  File "C:C:\Users\rmlj\Dropbox\songkick\scrapers\Data\Scraper.py".py", line 25, in load_artists
    r = requests.get(url)
  File "C:\Python27\lib\site-packages\requests\api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python27\lib\site-packages\requests\api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 474, in request
    prep = self.prepare_request(req)
  File "C:\Python27\lib\site-packages\requests\sessions.py", line 407, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "C:\Python27\lib\site-packages\requests\models.py", line 302, in prepare
    self.prepare_url(url, params)
  File "C:\Python27\lib\site-packages\requests\models.py", line 358, in prepare_url
    url = url.decode('utf8')
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 64-65: invalid continuation byte

Answer 1

问题在于您构建了一个URL，您将查询字符串作为bytes（Python 2.x上的常规str）传递给{的非utf-8编码字符{1}}模块，它反过来试图将其转换为utf-8 unicode字符串，然后失败。

首先，您应该让requests模块形成您的查询字符串并处理最终网址的创建：

requests

但其次，你不应该混淆你想要成为一个受伤世界的编码。不幸的是，内置的url = "http://api.songkick.com/api/3.0/search/artists.json" r = requests.get(url, params={"query": artist, "apikey": ""}) # etc.模块不能与Unicode一起使用，这可能是你最终使用无效字符的原因。要解决此问题，请安装unicodecsv并将其用作替代品（只需将csv替换为import csv）。

更新：等等，再看看你甚至没有使用csv。您正在逐行读取文件并尝试将其作为查询传递。这是你的预期行为吗？如果是这种情况，请遵循保持相同编码的想法：

import unicodecsv as csv

Answer 2

你应该尽可能使用unicode。请求应将url中的任何非ascii字符转换为正确的编码。

>>> import requests  

>>> requests.get(u'http://Motörhead.com/?q=Motörhead').url  
u'http://xn--motrhead-p4a.com/?q=Mot%C3%B6rhead'

如您所见，域名编码为punycode，查询字符串使用percent-encoding。

只要artist是有效的unicode字符串，这应该有效。

url = u'http://api.songkick.com/api/3.0/search/artists.json?query='+artist

如果artist是字节字符串，则必须使用正确的编码将其解码为unicode，这取决于原始输入文件的编码方式。

artist = artist.decode('SHIFT-JIS')

UnicodeDecodeError utf8编解码器Python 2.7

2 个答案: