Python ASCII编解码器在写入CSV期间无法编码字符错误

时间:2015-10-05 00:25:05

标签: python csv url utf-8 beautifulsoup

我不完全确定我需要对此错误做些什么。我认为它与需要添加.encode('utf-8')有关。但我不完全确定这是我需要做的,也不应该在哪里应用。

错误是:

function sendRequest($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

    /*curl_setopt($ch, CURLOPT_HTTPHEADER, array(
        'GET '.$url.' HTTP/1.1', // Are you sure about this?
        'User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; ru; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3',
        'Accept: text/html',
        'Accept-Language: ru,en-us;',
        'Accept-Charset: windows-1251,utf-8;',
        'Connection: close'
    ));*/

    $contents = curl_exec($ch);
    curl_close($ch);

    return $contents;
}

function getUrlContents($url, $maximumRedirections = null, $currentRedirection = 0)
{
    $result = false;

    $contents = sendRequest($url);

    // Check if we need to go somewhere else

    if (isset($contents) && is_string($contents))
    {
        preg_match_all('/<[\s]*meta[\s]*http-equiv="?REFRESH"?' . '[\s]*content="?[0-9]*;[\s]*URL[\s]*=[\s]*([^>"]*)"?' . '[\s]*[\/]?[\s]*>/si', $contents, $match);

        if (isset($match) && is_array($match) && count($match) == 2 && count($match[1]) == 1)
        {
            if (!isset($maximumRedirections) || $currentRedirection < $maximumRedirections)
            {
                return getUrlContents($match[1][0], $maximumRedirections, ++$currentRedirection);
            }

            $result = false;
        }
        else
        {
            $result = $contents;
        }
    }

    return $contents;
}

echo getUrlContents('http://wtion');

这是我的python脚本的基础。

line 40, in <module>
writer.writerows(list_of_rows)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 1
7: ordinal not in range(128)

3 个答案:

答案 0 :(得分:20)

Python 2.x CSV库已损坏。你有三个选择。按复杂程度排列:

  1. 编辑:请参阅下文使用固定库https://github.com/jdunck/python-unicodecsvpip install unicodecsv)。用作替代品 - 示例:

    with open("myfile.csv", 'rb') as my_file:    
        r = unicodecsv.DictReader(my_file, encoding='utf-8')
    
  2. <击> <击>

    <击>

    1. 阅读有关Unicode的CSV手册:https://docs.python.org/2/library/csv.html(参见底部示例)

    2. 将每个项目手动编码为UTF-8:

      for cell in row.findAll('td'):
          text = cell.text.replace('[','').replace(']','')
          list_of_cells.append(text.encode("utf-8"))
      
    3. 编辑,我发现在阅读UTF-16时,python-unicodecsv也被破坏。它抱怨任何0x00字节。

      相反,使用https://github.com/ryanhiebert/backports.csv,它更接近Python 3的实现并使用io模块..

      安装:

      pip install backports.csv
      

      用法:

      from backports import csv
      import io
      
      with io.open(filename, encoding='utf-8') as f:
          r = csv.reader(f):
      

答案 1 :(得分:0)

除了Alastair的优秀建议外,我发现最简单的选择是使用python3而不是python 2.我的脚本中所需要的只是更改wb open语句只需accordance with Python3's syntax中的w语句。

答案 2 :(得分:0)

问题出在python 2中的csv库中。 来自unicodecsv project page

Python 2的csv模块无法轻松处理unicode字符串,从而导致可怕的“'ascii'编解码器无法在位置编码字符...”异常。

如果可以,只需安装unicodecsv

user.get().then(doc => { //you get user doc value by using data() const userData = doc.data(); // then you can use all properties from userData const verified = userData.verified; });

pip install unicodecsv