代码是
!wget -q -O 'boroughs.html' "https://en.wikipedia.org/wiki/List_of_London_boroughs"
with open('boroughs.html', encoding='utf-8-sig') as fp:
soup = BeautifulSoup(fp,"lxml")
data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols]
data.append([col for col in cols]) # Get rid of empty values
data
经过一些研究,我添加了encoding='utf-8-sig'
以打开。但是在输出I still see the characters \ufeff:
让我感到困惑的是,我什至尝试了
df = df.replace(u'\ufeff', '')
将数据添加到pandas数据框后
字符仍然存在。
答案 0 :(得分:0)
尝试改用utf8
:
with open('boroughs.html', encoding='utf8') as fp:
doc = html.fromstring(fp.read())
data = []
rows = doc.xpath("//table/tbody/tr")
for row in rows:
cols = row.xpath("./td/text()")
cols = [col.strip() for col in cols if col.strip()]
data.append(cols)
答案 1 :(得分:0)
我已使用带有简单str.replace(u'\ufeff', '')
的 Python 3.6.1 尝试了您的代码,它似乎可以正常工作。
已测试代码:
import os
from bs4 import BeautifulSoup
os.system('wget -q -O "boroughs.html" "https://en.wikipedia.org/wiki/List_of_London_boroughs"')
with open('boroughs.html', encoding='utf-8-sig') as fp:
soup = BeautifulSoup(fp,"lxml")
data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols]
data.append([col.replace(u'\ufeff', '') for col in cols])
print(data)
替换前的输出:
[[],['Barking and Dagenham [note 1]','','','Barking and Dagenham 伦敦自治市镇理事会”,“劳工”,“城镇广场1号市政厅”, '13 .93','194,352','51°33'39″ N 0°09′21″ E \ ufeff / \ ufeff51.5607°N 0.1557°E \ ufeff / 51.5607; 0.1557 \ ufeff(Barking and Dagenham)','25'],...]
替换后的输出:
[[],['Barking and Dagenham [note 1]','','','Barking and Dagenham 伦敦自治市镇理事会”,“劳工”,“城镇广场1号市政厅”, '13 .93','194,352','51°33'39″ N 0°09′21″ E / 51.5607°N0.1557°E / 51.5607; 0.1557(Barking and Dagenham)','25'],...]