以下脚本会返回'UnicodeEncode Error: 'ascii' codec can't encode character '\xf8' in position 118: ordinal not in range(128)'
我无法找到一个很好的解释。
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
results = {}
for page_num in range(0, 1000, 20):
address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = urlopen(address)
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all(class_='table-condensed')
output = pd.read_html(str(table))[0]
results[page_num] = output
df = pd.concat([v for v in results.values()], axis = 0)
答案 0 :(得分:1)
您正在使用std库打开网址。该库强制将地址编码为ascii。因此,像ø这样的非ascii字符会抛出Unicode错误。
Line 1116-1117的http / client.py # Non-ASCII characters should have been eliminated earlier
self._output(request.encode('ascii'))
作为urllib.request的替代方案,第三方请求很棒。
import requests
address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = requests.get(address).text