从BeautifulSoup html解析器获取干净的文本文件

时间:2017-09-05 09:41:10

标签: python html python-3.x beautifulsoup

在尝试对Project Gutenberg文件执行文本分析时,我遇到了很多关于BeautifulSoup的问题(请参阅此处here)。我几乎把我的所有代码整理好了,但最后一个问题让我感到困惑:如何在我从BeautifulSoup清理的版本中删除一些冗余文本后得到一个干净的文本文件。让我解释一下:

步骤1:我在录制文本标题时提取文本减去html垃圾:

require 'drawchart.php';

cwrapper();

第2步:摆脱样板古腾堡许可证文本,这样就不会搞砸分析:

from bs4 import BeautifulSoup
import re

### Opens saved html file
html = open("/filepath/Jane_Eyre_Test.htm")

### Cleans html file
soup = BeautifulSoup(html, 'html.parser')


title = re.findall(r'<title>(.*?)</title>',soup.get_text())

步骤3:打开文本文件将结果写入:

s1 = '***START OF THE PROJECT GUTENBERG EBOOK '+title[0].upper()+'***'

s2 = '***END OF THE PROJECT GUTENBERG EBOOK '+title[0].upper()+'***'

main_text = soup.get_text()[(soup.get_text().index(s1)+len(s1)):soup.get_text().index(s2)]

现在,问题出现了:当我这样做时,生成的文本文件中充满了格式化标签,例如:

  

&LT; /预&GT;   &lt; p&gt;&lt; a name =“startoftext”&gt;&lt; / a&gt;&lt; / p&gt; &lt; p&gt;转录自1897年服务   &放大器;放大器; Paton版,David Price,电子邮件   ccx074@pglaf.org< / p为H.

但是当我尝试使用美丽的汤来清洁它时,

#### Opens blank text file
f = open('filepath/'+titles[0]+'.txt', 'w')
f.write(main_text)

结果并没有好多少。

main_text1 = BeautifulSoup(main_text, 'html.parser')
f.write(main_text1.get_text())

尽管事实是

</pre> <p><a name="startoftext"></a></p> <p>Transcribed from the 1897
Service &amp; Paton edition by David Price, email ccx074@pglaf.org</p>

生成格式正确的文本文件。我怀疑我在文本格式和HTML格式之间缺少一些重要的区别;如果是这样,任何指示表示赞赏当然,任何摆脱文本格式标签的解决方案都会更受欢迎。

2 个答案:

答案 0 :(得分:1)

尝试以下方法,get_text()应该可以在soup对象上正常工作:

from bs4 import BeautifulSoup
import re

with open('Jane_Eyre_Test.htm') as f_jane_html:
    soup = BeautifulSoup(f_jane_html, "html.parser")

a = soup.find('a', attrs={"name" : "startoftext"})
text = a.parent.parent.get_text()

start = re.escape("***START OF THE PROJECT GUTENBERG EBOOK JANE EYRE***")
end = re.escape("***END OF THE PROJECT GUTENBERG EBOOK")
text = re.search('{}(.*){}'.format(start, end), text, re.S).group(1)

with open('Jane_Eyre.txt', 'w') as f_jane_text:
    f_jane_text.write(text)   

这将为您提供一个文件的开头和结尾,如下所示:

Transcribed from the 1897 Service & Paton edition by David

Price, email ccx074@pglaf.org
JANE EYRE

AN AUTOBIOGRAPHY
by
.
.
.
I come quickly!’ and hourly I more eagerly

respond,—‘Amen; even so come, Lord

Jesus!’”

用于测试此内容的HTML取自Jane Eyre, by Charlotte Bronte

测试文件创建如下:

import requests

r = requests.get("http://www.gutenberg.org/files/1260/1260-h/1260-h.htm")

with open('Jane_Eyre_Test.htm', 'w') as f_jane_eyre:
    f_jane_eyre.write(r.content)

答案 1 :(得分:0)

删除一些冗余文本后,您可以获得一个干净的文本文件。你follow from this

>>> with open("Book_titles.txt", "w") as file:
...     for line in x1:
...             file.writelines(line)
...             file.writelines('\n')
...
>>>