使用beautifulsoup刮表时如何避免使用UnicodeEncodeError'\ xf8'

时间:2017-09-29 10:22:38

标签: python-3.x web-scraping beautifulsoup

以下脚本会返回'UnicodeEncode Error: 'ascii' codec can't encode character '\xf8' in position 118: ordinal not in range(128)'

我无法找到一个很好的解释。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

results = {}

for page_num in range(0, 1000, 20):
    address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører' 

    html = urlopen(address)
    soup = BeautifulSoup(html, 'lxml')
    table = soup.find_all(class_='table-condensed')
    output = pd.read_html(str(table))[0]
    results[page_num] = output


df = pd.concat([v for v in results.values()], axis = 0)

1 个答案:

答案 0 :(得分:1)

您正在使用std库打开网址。该库强制将地址编码为ascii。因此,像ø这样的非ascii字符会抛出Unicode错误。

Line 1116-1117的http / client.py

    # Non-ASCII characters should have been eliminated earlier
    self._output(request.encode('ascii'))

作为urllib.request的替代方案,第三方请求很棒。

import requests

address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
html = requests.get(address).text