如何将html文件转换为人类可读的txt文件?

时间:2011-08-23 03:05:42

标签: python

我有很多html文件看起来像这样:

<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
 Summary:
</b>
&nbsp;According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations.  On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed.  The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
 SIC Code:
</b>
0000
<br />
<b>
 Sector:
</b>
N/A
<br />
<b>
 Industry:
</b>
N/A
<br />
</font>

我想要做的是取出文件中间的文本并将其转换为人类可读的格式。 在这个例子中,它是:

  

根据2011年1月4日提交的投诉,a   2007年12月和2008年1月的6周期间,六个医疗保健   由被告FrontPoint Partners LLC管理的相关对冲基金   (“FrontPoint”)出售了600多万股人类基因组   科学公司(“HGSI”)普通股,而他们的投资组合经理   拥有关于该公司的重大负面非公开信息   HGSI的药物白蛋白干扰素Alfa 2-a的临床试验。
  
2011年3月2日,原告提交了第一次修订类   行动投诉,修改指定的被告和证券   违法行为。 2011年3月22日,任命为主要动议   提起原告并批准选择首席律师。   被告通过提交a。对第一次修订投诉作出回应   动议于2011年3月28日解散。

我知道我必须做三件事,他们是:

  1. 取出文件中间的文字
  2. "<br />"替换为"\n"
  3. "&nbsp;"替换为" "(一个空格)
  4. 我知道后两件事很简单,只是在Python中使用replace方法,但我不知道如何实现第一个目标。

    我知道正则表达式和BeautifulSoup,但我不知道如何将它们应用于这个问题。

    有人可以帮助我吗?

    谢谢,对不起我的英语很差。

    @Paul:我只想要一个摘要部分。我的老师(对计算机知之甚少)给了我很多html文件,并要求我将它们转换成适合数据挖掘的格式(我的老师尝试使用SAS来完成这项工作)。 我不知道SAS,但我认为它可能用于处理大量的txt文件,因此我想将这些html文件转换为普通的txt文件。

    @Owen:我需要处理很多html文件,我觉得这个问题不难处理,所以我想用Python直接解决它。

3 个答案:

答案 0 :(得分:3)

您可以使用Scrapely。

  

Scrapely是一个用于从HTML页面中提取结构化数据的库。给定一些示例网页和要提取的数据,scrapely为所有类似页面构建解析器。

http://github.com/scrapy/scrapely

答案 1 :(得分:2)

要完成此任务,您可以使用名为 Lxml 的Python库的帮助。

  • 首先,下载并安装 Lxml

现在尝试运行以下代码:

from lxml.html import fromstring

html = '''
<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
 Summary:
</b>
&nbsp;According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations.  On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed.  The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
 SIC Code:
</b>
0000
<br />
<b>
 Sector:
</b>
N/A
<br />
<b>
 Industry:
</b>
N/A
<br />
</font>
'''

htmlElement = fromstring(html)
textContent = htmlElement.text_content()
result = textContent.split('\n\n Summary:\n\n')[1].split('\n\nINDUSTRY CLASSIFICATION:\n\n')[0]

print result

如果'\ n \ n摘要: 之前所需的文字和'\ n \ n行业分类,此代码将有效:\ n \ n 所需的文字之后。

答案 2 :(得分:1)

最近的一个是将HTML转换为reStructureText,您可以尝试在线here,输出如下。

 **Summary:** According to the complaint filed January 04, 2011, over a
six-week period in December 2007 and January 2008, six healthcare
related hedge funds managed by Defendant FrontPoint Partners LLC
(“FrontPoint”) sold more than six million shares of Human Genome
Sciences, Inc. (“HGSI”) common stock while their portfolio manager
possessed material negative non-public information concerning the HGSI’s
clinical trial for the drug Albumin Interferon Alfa 2-a.
 On March 2, 2011, the plaintiffs filed a First Amended Class Action
Complaint, amending the named defendants and securities violations. On
March 22, 2011, a motion for appointment as lead plaintiff and for
approval of selection of lead counsel was filed. The defendants
responded to the First Amended Complaint by filing a motion to dismiss
on March 28, 2011.

--------------

INDUSTRY CLASSIFICATION:
 **SIC Code:** 0000
 **Sector:** N/A
 **Industry:** N/A