写入Microsoft Excel时Windows中的Unicode错误

时间:2017-01-23 22:52:07

标签: python excel unicode web-scraping lxml

我在Python中编写了一个刮刀,它从futbin.com网站上抓取玩家数据并将其写入.csv文件。我收到以下错误,发生在第214页,www.futbin.com / 17 / player214。完全追溯:

 Traceback (most recent call last):
  File "C:/Users/jona_/PycharmProjects/untitled2/futbin_scraper_2.py", line 94, in <module>
    writer.writerows([prices_attributes])
  File "C:\Program Files\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u015f' in position 145: character maps to <undefined>

我怀疑是因为页面上的这一段数据:'BeşiktaşJK'(和其他人一样)。我想怪异的's'字符对于Windows控制台是不可读的。我试过改变我的控制台编码。它目前设置为utf-8,我使用它检查:

$import sys
$print(sys.stdin.encoding)
output: utf-8

>>> print(sys.stdout.encoding)
output: cp437

我也尝试使用set PYTHONIOENCODING=utf-16命令将其设置为utf-16,并且我已经安装了win-unicode-console软件包,但它并没有解决我的问题。为了完整起见,我将在下面发布整个脚本。

当我添加行league = html_tree.xpath('//td/a[@href]/text()', smart_strings=False)时,问题就开始了。这会从页面左侧的“信息”表中删除数据。

还有一些关于unicode错误的问题,我老老实实地尝试了我有能力理解的每个解决方案。

我在Windows 10上使用了jetbrains pycharm社区版IDE,python 3.5。

任何帮助将不胜感激

#
# This programme fetches price data and player attributes from the FIFA 17 Ultimate Team Market
# And writes them into a .csv file.

import csv
import requests
from lxml import html
import time
import os.path
import sys

#
# This creates a .csv file in a pre-specified directory to write the player data into
# Change: save_path and name_of_file
save_path = 'D:/Msc Finance/Thesis/Futbin Data/'
name_of_file = ("futbin_data")
completeName = os.path.join(save_path, name_of_file+".csv")
outfile = open(completeName, "w", newline='', )

#
# This generates a list of futbin.com URLs to feed into the script
# Change: integers in range() to specify the amount of futbin.com player pages to parse
amount_of_players = 16300
list_of_urls = []
for i in list(range(16300)):
    id = i+1
    url = "https://www.futbin.com/17/player/{0}".format(id)
    list_of_urls.append(url)

#
# This loop finds all the player data from each url in list_of_urls and stores them into a list

for url in list_of_urls:
    responses = requests.get(url)
    html_tree = html.fromstring(responses.content)
    name = html_tree.xpath('//span[@class = "header_name"]/text()', smart_strings=False)
    prices = html_tree.xpath('//span[@class ="bin_text"]/text()', smart_strings=False)
    attributes = html_tree.xpath('//td[@class ="table-row-text"]/text()', smart_strings=False)
    league = html_tree.xpath('//td/a[@href]/text()', smart_strings=False)
    position = html_tree.xpath('//div[@class ="pcdisplay-pos"]/text()', smart_strings=False)
    rating = html_tree.xpath('//div[@class ="pcdisplay-rat"]/text()', smart_strings=False)
    pace = html_tree.xpath('//div[@class ="pcdisplay-ovr1"]/text()', smart_strings=False)
    shot = html_tree.xpath('//div[@class ="pcdisplay-ovr2"]/text()', smart_strings=False)
    passing = html_tree.xpath('//div[@class ="pcdisplay-ovr3"]/text()', smart_strings=False)
    dribble = html_tree.xpath('//div[@class ="pcdisplay-ovr4"]/text()', smart_strings=False)
    defense = html_tree.xpath('//div[@class ="pcdisplay-ovr5"]/text()', smart_strings=False)
    physique = html_tree.xpath('//div[@class ="pcdisplay-ovr6"]/text()', smart_strings=False)

    # This merges all the player data together into one big list
    prices_attributes = prices + attributes + league + position + rating + pace + shot + passing + dribble + defense + \
                        physique + name

    # This removes all instances of \n from the big list
    prices_attributes = [i.replace('\n', '') for i in prices_attributes]

    # This removes all blank spaces from the big list
    prices_attributes = [i.replace(' ', '') for i in prices_attributes]

    # In some instances the '//td[@class ="table-row-text"]/text()' Xpath from attributes returns an extra empty element
    # This 'if' statement removes the extra element to ensure all the columns in the .cvs file still align properly
    if len(prices_attributes) > 40:
        prices_attributes.pop(25)
        prices_attributes.pop(30)

    #
    # This removes all the remaining empty elements from the big list. Not(12,13,14,24,25,26) because:
    # Index numbers shift dynamically as the script removes elements from the list
    if prices_attributes:
        prices_attributes.pop(11)
        prices_attributes.pop(11)
        prices_attributes.pop(11)
        prices_attributes.pop(20)
        prices_attributes.pop(20)
        prices_attributes.pop(20)

    # Some URLs from list_of_urls no longer exist. These URLs yield empty lists: []
    # The 'if' statement below makes sure only non-empty lists are written to the Excel file
    if prices_attributes:
        writer = csv.writer(outfile)
        writer.writerows([prices_attributes])

    # This fixes the delay between queries to 0.1 seconds
    time.sleep(0.1)

    # This prints the loop's % progress into the Python Console
    sys.stdout.write("\r%d%%" % ((100/amount_of_players)*(list_of_urls.index(url)+1)))
    sys.stdout.flush()

1 个答案:

答案 0 :(得分:0)

从回溯和代码中,您使用的是Python 3并使用默认编码打开输出文件。 locale.getpreferredencoding(False)是默认使用的,在您的情况下为cp1252。请改用utf-8-sig,而不是utf8。 Excel假定没有字节顺序标记(BOM)签名的文件也是默认编码。

在您的代码中,使用:

outfile = open(completeName,'w',newline='',encoding='utf-8-sig')