我在Python中编写了一个刮刀,它从futbin.com网站上抓取玩家数据并将其写入.csv文件。我收到以下错误,发生在第214页,www.futbin.com / 17 / player214。完全追溯:
Traceback (most recent call last):
File "C:/Users/jona_/PycharmProjects/untitled2/futbin_scraper_2.py", line 94, in <module>
writer.writerows([prices_attributes])
File "C:\Program Files\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u015f' in position 145: character maps to <undefined>
我怀疑是因为页面上的这一段数据:'BeşiktaşJK'(和其他人一样)。我想怪异的's'字符对于Windows控制台是不可读的。我试过改变我的控制台编码。它目前设置为utf-8,我使用它检查:
$import sys
$print(sys.stdin.encoding)
output: utf-8
>>> print(sys.stdout.encoding)
output: cp437
我也尝试使用set PYTHONIOENCODING=utf-16
命令将其设置为utf-16,并且我已经安装了win-unicode-console软件包,但它并没有解决我的问题。为了完整起见,我将在下面发布整个脚本。
当我添加行league = html_tree.xpath('//td/a[@href]/text()', smart_strings=False)
时,问题就开始了。这会从页面左侧的“信息”表中删除数据。
还有一些关于unicode错误的问题,我老老实实地尝试了我有能力理解的每个解决方案。
我在Windows 10上使用了jetbrains pycharm社区版IDE,python 3.5。
任何帮助将不胜感激
#
# This programme fetches price data and player attributes from the FIFA 17 Ultimate Team Market
# And writes them into a .csv file.
import csv
import requests
from lxml import html
import time
import os.path
import sys
#
# This creates a .csv file in a pre-specified directory to write the player data into
# Change: save_path and name_of_file
save_path = 'D:/Msc Finance/Thesis/Futbin Data/'
name_of_file = ("futbin_data")
completeName = os.path.join(save_path, name_of_file+".csv")
outfile = open(completeName, "w", newline='', )
#
# This generates a list of futbin.com URLs to feed into the script
# Change: integers in range() to specify the amount of futbin.com player pages to parse
amount_of_players = 16300
list_of_urls = []
for i in list(range(16300)):
id = i+1
url = "https://www.futbin.com/17/player/{0}".format(id)
list_of_urls.append(url)
#
# This loop finds all the player data from each url in list_of_urls and stores them into a list
for url in list_of_urls:
responses = requests.get(url)
html_tree = html.fromstring(responses.content)
name = html_tree.xpath('//span[@class = "header_name"]/text()', smart_strings=False)
prices = html_tree.xpath('//span[@class ="bin_text"]/text()', smart_strings=False)
attributes = html_tree.xpath('//td[@class ="table-row-text"]/text()', smart_strings=False)
league = html_tree.xpath('//td/a[@href]/text()', smart_strings=False)
position = html_tree.xpath('//div[@class ="pcdisplay-pos"]/text()', smart_strings=False)
rating = html_tree.xpath('//div[@class ="pcdisplay-rat"]/text()', smart_strings=False)
pace = html_tree.xpath('//div[@class ="pcdisplay-ovr1"]/text()', smart_strings=False)
shot = html_tree.xpath('//div[@class ="pcdisplay-ovr2"]/text()', smart_strings=False)
passing = html_tree.xpath('//div[@class ="pcdisplay-ovr3"]/text()', smart_strings=False)
dribble = html_tree.xpath('//div[@class ="pcdisplay-ovr4"]/text()', smart_strings=False)
defense = html_tree.xpath('//div[@class ="pcdisplay-ovr5"]/text()', smart_strings=False)
physique = html_tree.xpath('//div[@class ="pcdisplay-ovr6"]/text()', smart_strings=False)
# This merges all the player data together into one big list
prices_attributes = prices + attributes + league + position + rating + pace + shot + passing + dribble + defense + \
physique + name
# This removes all instances of \n from the big list
prices_attributes = [i.replace('\n', '') for i in prices_attributes]
# This removes all blank spaces from the big list
prices_attributes = [i.replace(' ', '') for i in prices_attributes]
# In some instances the '//td[@class ="table-row-text"]/text()' Xpath from attributes returns an extra empty element
# This 'if' statement removes the extra element to ensure all the columns in the .cvs file still align properly
if len(prices_attributes) > 40:
prices_attributes.pop(25)
prices_attributes.pop(30)
#
# This removes all the remaining empty elements from the big list. Not(12,13,14,24,25,26) because:
# Index numbers shift dynamically as the script removes elements from the list
if prices_attributes:
prices_attributes.pop(11)
prices_attributes.pop(11)
prices_attributes.pop(11)
prices_attributes.pop(20)
prices_attributes.pop(20)
prices_attributes.pop(20)
# Some URLs from list_of_urls no longer exist. These URLs yield empty lists: []
# The 'if' statement below makes sure only non-empty lists are written to the Excel file
if prices_attributes:
writer = csv.writer(outfile)
writer.writerows([prices_attributes])
# This fixes the delay between queries to 0.1 seconds
time.sleep(0.1)
# This prints the loop's % progress into the Python Console
sys.stdout.write("\r%d%%" % ((100/amount_of_players)*(list_of_urls.index(url)+1)))
sys.stdout.flush()
答案 0 :(得分:0)
从回溯和代码中,您使用的是Python 3并使用默认编码打开输出文件。 locale.getpreferredencoding(False)
是默认使用的,在您的情况下为cp1252
。请改用utf-8-sig
,而不是utf8
。 Excel假定没有字节顺序标记(BOM)签名的文件也是默认编码。
在您的代码中,使用:
outfile = open(completeName,'w',newline='',encoding='utf-8-sig')