Question

代码是

!wget -q -O 'boroughs.html' "https://en.wikipedia.org/wiki/List_of_London_boroughs"

with open('boroughs.html', encoding='utf-8-sig') as fp:
    soup = BeautifulSoup(fp,"lxml")


data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col for col in cols]) # Get rid of empty values
data

经过一些研究，我添加了encoding='utf-8-sig'以打开。但是在输出I still see the characters \ufeff:

中

让我感到困惑的是，我什至尝试了

df = df.replace(u'\ufeff', '')

将数据添加到pandas数据框后

字符仍然存在。

Answer 1

尝试改用utf8：

with open('boroughs.html', encoding='utf8') as fp:
    doc = html.fromstring(fp.read())

    data = []
    rows = doc.xpath("//table/tbody/tr")
    for row in rows:
        cols = row.xpath("./td/text()")
        cols = [col.strip() for col in cols if col.strip()]
        data.append(cols)

Answer 2

我已使用带有简单str.replace(u'\ufeff', '')的 Python 3.6.1 尝试了您的代码，它似乎可以正常工作。

已测试代码：

import os
from bs4 import BeautifulSoup

os.system('wget -q -O "boroughs.html" "https://en.wikipedia.org/wiki/List_of_London_boroughs"')

with open('boroughs.html', encoding='utf-8-sig') as fp:
    soup = BeautifulSoup(fp,"lxml")

data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col.replace(u'\ufeff', '') for col in cols])
print(data)

替换前的输出：

[[]，['Barking and Dagenham [note 1]'，''，''，'Barking and Dagenham 伦敦自治市镇理事会”，“劳工”，“城镇广场1号市政厅”， '13 .93'，'194,352'，'51°33'39″ N 0°09′21″ E \ ufeff / \ ufeff51.5607°N 0.1557°E \ ufeff / 51.5607; 0.1557 \ ufeff（Barking and Dagenham）'，'25']，...]

替换后的输出：

[[]，['Barking and Dagenham [note 1]'，''，''，'Barking and Dagenham 伦敦自治市镇理事会”，“劳工”，“城镇广场1号市政厅”， '13 .93'，'194,352'，'51°33'39″ N 0°09′21″ E / 51.5607°N0.1557°E / 51.5607； 0.1557（Barking and Dagenham）'，'25']，...]

如何在解析的HTML页面中摆脱\ ufeff

2 个答案: