如何在解析的HTML页面中摆脱\ ufeff

时间:2019-06-30 15:05:54

标签: python beautifulsoup

代码是

!wget -q -O 'boroughs.html' "https://en.wikipedia.org/wiki/List_of_London_boroughs"

with open('boroughs.html', encoding='utf-8-sig') as fp:
    soup = BeautifulSoup(fp,"lxml")


data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col for col in cols]) # Get rid of empty values
data

经过一些研究,我添加了encoding='utf-8-sig'以打开。但是在输出I still see the characters \ufeff:

让我感到困惑的是,我什至尝试了

df = df.replace(u'\ufeff', '') 

将数据添加到pandas数据框后

字符仍然存在。

2 个答案:

答案 0 :(得分:0)

尝试改用utf8

with open('boroughs.html', encoding='utf8') as fp:
    doc = html.fromstring(fp.read())

    data = []
    rows = doc.xpath("//table/tbody/tr")
    for row in rows:
        cols = row.xpath("./td/text()")
        cols = [col.strip() for col in cols if col.strip()]
        data.append(cols)

答案 1 :(得分:0)

我已使用带有简单str.replace(u'\ufeff', '') Python 3.6.1 尝试了您的代码,它似乎可以正常工作。

已测试代码:

import os
from bs4 import BeautifulSoup

os.system('wget -q -O "boroughs.html" "https://en.wikipedia.org/wiki/List_of_London_boroughs"')

with open('boroughs.html', encoding='utf-8-sig') as fp:
    soup = BeautifulSoup(fp,"lxml")

data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append([col.replace(u'\ufeff', '') for col in cols])
print(data)

替换前的输出:

  

[[],['Barking and Dagenham [note 1]','','','Barking and Dagenham   伦敦自治市镇理事会”,“劳工”,“城镇广场1号市政厅”,   '13 .93','194,352','51°33'39″ N 0°09′21″ E \ ufeff / \ ufeff51.5607°N   0.1557°E \ ufeff / 51.5607; 0.1557 \ ufeff(Barking and Dagenham)','25'],...]

替换后的输出:

  

[[],['Barking and Dagenham [note 1]','','','Barking and Dagenham   伦敦自治市镇理事会”,“劳工”,“城镇广场1号市政厅”,   '13 .93','194,352','51°33'39″ N 0°09′21″ E / 51.5607°N0.1557°E /   51.5607; 0.1557(Barking and Dagenham)','25'],...]