如何刮掉多个div中的文本内容

时间:2015-10-05 14:32:49

标签: python html web-scraping

我需要在此URL处仅删除h3中参考文献下的文字内容,我尝试使用此代码,但我无法按相同顺序获取文字在html页面中。

3

我希望返回一个数组,其中包含没有html标记但只包含文本内容的引用下的每一行。

1 个答案:

答案 0 :(得分:1)

正如评论中所建议的那样,BeautifulSoup让它非常简单:

In [2]: from bs4 import BeautifulSoup

In [3]: import urllib2

In [4]: url = "http://www.dlib.org/dlib/november14/brook/11brook.html"

In [5]: soup = BeautifulSoup(urllib2.urlopen(url))

In [6]: for h3 in soup.find_all("h3"):
   ...:     print(h3.text)
   ...:     
D-Lib Magazine
The Social, Political and Legal Aspects of Text and Data Mining (TDM)
Abstract
1. Introduction
2. Copyright, database right, licences and TDM
3. Recent changes to UK law
4. What can politicians and policy makers do? 
5. Publishers are not embracing opportunities of TDM
6. How can publishers help TDM researchers?
7. Awareness among academics and a technological gap 
8. Conclusion
Notes
References
About the Authors