Question

在尝试对Project Gutenberg文件执行文本分析时，我遇到了很多关于BeautifulSoup的问题（请参阅此处here）。我几乎把我的所有代码整理好了，但最后一个问题让我感到困惑：如何在我从BeautifulSoup清理的版本中删除一些冗余文本后得到一个干净的文本文件。让我解释一下：

步骤1：我在录制文本标题时提取文本减去html垃圾：

require 'drawchart.php';

cwrapper();

第2步：摆脱样板古腾堡许可证文本，这样就不会搞砸分析：

from bs4 import BeautifulSoup
import re

### Opens saved html file
html = open("/filepath/Jane_Eyre_Test.htm")

### Cleans html file
soup = BeautifulSoup(html, 'html.parser')


title = re.findall(r'<title>(.*?)</title>',soup.get_text())

步骤3：打开文本文件将结果写入：

s1 = '***START OF THE PROJECT GUTENBERG EBOOK '+title[0].upper()+'***'

s2 = '***END OF THE PROJECT GUTENBERG EBOOK '+title[0].upper()+'***'

main_text = soup.get_text()[(soup.get_text().index(s1)+len(s1)):soup.get_text().index(s2)]

现在，问题出现了：当我这样做时，生成的文本文件中充满了格式化标签，例如：

＆LT; /预＆GT; ＆lt; p＆gt;＆lt; a name =“startoftext”＆gt;＆lt; / a＆gt;＆lt; / p＆gt; ＆lt; p＆gt;转录自1897年服务＆放大器;放大器; Paton版，David Price，电子邮件 ccx074@pglaf.org< / p为H.

但是当我尝试使用美丽的汤来清洁它时，

#### Opens blank text file
f = open('filepath/'+titles[0]+'.txt', 'w')
f.write(main_text)

结果并没有好多少。

main_text1 = BeautifulSoup(main_text, 'html.parser')
f.write(main_text1.get_text())

尽管事实是

</pre> <p><a name="startoftext"></a></p> <p>Transcribed from the 1897
Service &amp; Paton edition by David Price, email ccx074@pglaf.org</p>

生成格式正确的文本文件。我怀疑我在文本格式和HTML格式之间缺少一些重要的区别;如果是这样，任何指示表示赞赏当然，任何摆脱文本格式标签的解决方案都会更受欢迎。

Answer 1

尝试以下方法，get_text()应该可以在soup对象上正常工作：

from bs4 import BeautifulSoup
import re

with open('Jane_Eyre_Test.htm') as f_jane_html:
    soup = BeautifulSoup(f_jane_html, "html.parser")

a = soup.find('a', attrs={"name" : "startoftext"})
text = a.parent.parent.get_text()

start = re.escape("***START OF THE PROJECT GUTENBERG EBOOK JANE EYRE***")
end = re.escape("***END OF THE PROJECT GUTENBERG EBOOK")
text = re.search('{}(.*){}'.format(start, end), text, re.S).group(1)

with open('Jane_Eyre.txt', 'w') as f_jane_text:
    f_jane_text.write(text)

这将为您提供一个文件的开头和结尾，如下所示：

Transcribed from the 1897 Service & Paton edition by David

Price, email ccx074@pglaf.org
JANE EYRE

AN AUTOBIOGRAPHY
by
.
.
.
I come quickly!’ and hourly I more eagerly

respond,—‘Amen; even so come, Lord

Jesus!’”

用于测试此内容的HTML取自Jane Eyre, by Charlotte Bronte

测试文件创建如下：

import requests

r = requests.get("http://www.gutenberg.org/files/1260/1260-h/1260-h.htm")

with open('Jane_Eyre_Test.htm', 'w') as f_jane_eyre:
    f_jane_eyre.write(r.content)

Answer 2

删除一些冗余文本后，您可以获得一个干净的文本文件。你follow from this

>>> with open("Book_titles.txt", "w") as file:
...     for line in x1:
...             file.writelines(line)
...             file.writelines('\n')
...
>>>

从BeautifulSoup html解析器获取干净的文本文件

2 个答案: