Question

我在Win10上安装了Python 2.7.13，pip和beautifulsoup。我想将一个带有html实体的大文件转换为Unicode字符，我不知道该怎么做（我对Python不太了解）。文件内容如下所示：

<b>&#947;&#941;&#961;&#969;&#957;</b>, <i>&#959;&#957;&#964;&#959;&#962;, &#8001;</i>, Wurzel <i>&#915;&#917;&#929;</i>, verwandt mit <i>&#947;&#941;&#961;&#945;&#962;, &#947;&#949;&#961;&#945;&#961;&#972;&#962;, &#947;&#949;&#961;&#945;&#953;&#972;&#962;</i>

我可以使用EmEditor做一些小部件（使用编辑＆gt;编码/解码选择 - ＆gt;对Unicode的HTML / XML字符引用）但它太慢而且无法应对大文件转换。

我很乐意为此提供任何（离线）解决方案。

Answer 1

这是html编码，试试这个：

from HTMLParser import HTMLParser

f = open("myfile.txt")
h = HTMLParser()
new_file_content = h.unescape(f.read())
new_file = open("newfile.txt", 'w')
new_file.write(new_file_content)

Answer 2

BeautifulSoup有一个内置函数，用于执行此操作，称为site_read = site_download.read().decode('utf-8')。只需在读入文件时将其添加到行尾即可！

示例：

public static String getBrand(String abc){

      return abc;
  }

Answer 3

import bs4

html = '''<b>&#947;&#941;&#961;&#969;&#957;</b>, <i>&#959;&#957;&#964;&#959;&#962;, &#8001;</i>, Wurzel <i>&#915;&#917;&#929;</i>, verwandt mit <i>&#947;&#941;&#961;&#945;&#962;, &#947;&#949;&#961;&#945;&#961;&#972;&#962;, &#947;&#949;&#961;&#945;&#953;&#972;&#962;</i>'''

soup = bs4.BeautifulSoup(html, 'lxml')

出：

<html><body><b>γέρων</b>, <i>οντος, ὁ</i>, Wurzel <i>ΓΕΡ</i>, verwandt mit <i>γέρας, γεραρός, γεραιός</i></body></html>

Document：

要解析文档，请将其传递给BeautifulSoup构造函数。您可以传入一个字符串或一个打开的文件句柄：

from bs4 import BeautifulSoup
> 
> soup = BeautifulSoup(open("index.html"))  # you can open you file in here
> 
> soup = BeautifulSoup("<html>data</html>")

首先，文件是转换为Unicode ，和HTML实体转换为Unicode 字符：

Answer 4

感谢您的帮助，我确实设法使用最新版本的EmEditor轻松完成，事实证明它非常快：

选择文字＆gt;编辑＆gt;编码/解码选择 - ＆gt;对Unicode的HTML / XML字符引用

将html实体文件转换为Unicode（使用BeautifulSoup和Python？）

4 个答案: