在尝试对Project Gutenberg文件执行文本分析时,我遇到了很多关于BeautifulSoup的问题(请参阅此处here)。我几乎把我的所有代码整理好了,但最后一个问题让我感到困惑:如何在我从BeautifulSoup清理的版本中删除一些冗余文本后得到一个干净的文本文件。让我解释一下:
步骤1:我在录制文本标题时提取文本减去html垃圾:
require 'drawchart.php';
cwrapper();
第2步:摆脱样板古腾堡许可证文本,这样就不会搞砸分析:
from bs4 import BeautifulSoup
import re
### Opens saved html file
html = open("/filepath/Jane_Eyre_Test.htm")
### Cleans html file
soup = BeautifulSoup(html, 'html.parser')
title = re.findall(r'<title>(.*?)</title>',soup.get_text())
步骤3:打开文本文件将结果写入:
s1 = '***START OF THE PROJECT GUTENBERG EBOOK '+title[0].upper()+'***'
s2 = '***END OF THE PROJECT GUTENBERG EBOOK '+title[0].upper()+'***'
main_text = soup.get_text()[(soup.get_text().index(s1)+len(s1)):soup.get_text().index(s2)]
现在,问题出现了:当我这样做时,生成的文本文件中充满了格式化标签,例如:
&LT; /预&GT; &lt; p&gt;&lt; a name =“startoftext”&gt;&lt; / a&gt;&lt; / p&gt; &lt; p&gt;转录自1897年服务 &放大器;放大器; Paton版,David Price,电子邮件 ccx074@pglaf.org< / p为H.
但是当我尝试使用美丽的汤来清洁它时,
#### Opens blank text file
f = open('filepath/'+titles[0]+'.txt', 'w')
f.write(main_text)
结果并没有好多少。
main_text1 = BeautifulSoup(main_text, 'html.parser')
f.write(main_text1.get_text())
尽管事实是
</pre> <p><a name="startoftext"></a></p> <p>Transcribed from the 1897
Service & Paton edition by David Price, email ccx074@pglaf.org</p>
生成格式正确的文本文件。我怀疑我在文本格式和HTML格式之间缺少一些重要的区别;如果是这样,任何指示表示赞赏当然,任何摆脱文本格式标签的解决方案都会更受欢迎。
答案 0 :(得分:1)
尝试以下方法,get_text()
应该可以在soup
对象上正常工作:
from bs4 import BeautifulSoup
import re
with open('Jane_Eyre_Test.htm') as f_jane_html:
soup = BeautifulSoup(f_jane_html, "html.parser")
a = soup.find('a', attrs={"name" : "startoftext"})
text = a.parent.parent.get_text()
start = re.escape("***START OF THE PROJECT GUTENBERG EBOOK JANE EYRE***")
end = re.escape("***END OF THE PROJECT GUTENBERG EBOOK")
text = re.search('{}(.*){}'.format(start, end), text, re.S).group(1)
with open('Jane_Eyre.txt', 'w') as f_jane_text:
f_jane_text.write(text)
这将为您提供一个文件的开头和结尾,如下所示:
Transcribed from the 1897 Service & Paton edition by David
Price, email ccx074@pglaf.org
JANE EYRE
AN AUTOBIOGRAPHY
by
.
.
.
I come quickly!’ and hourly I more eagerly
respond,—‘Amen; even so come, Lord
Jesus!’”
用于测试此内容的HTML取自Jane Eyre, by Charlotte Bronte
测试文件创建如下:
import requests
r = requests.get("http://www.gutenberg.org/files/1260/1260-h/1260-h.htm")
with open('Jane_Eyre_Test.htm', 'w') as f_jane_eyre:
f_jane_eyre.write(r.content)
答案 1 :(得分:0)
删除一些冗余文本后,您可以获得一个干净的文本文件。你follow from this
>>> with open("Book_titles.txt", "w") as file:
... for line in x1:
... file.writelines(line)
... file.writelines('\n')
...
>>>