beautifulsoup提取句子,如果它包含关键字

时间:2017-03-13 12:43:10

标签: python html web-scraping beautifulsoup

我想处理一个html网站(例如这一个:http://www.uni-bremen.de/mscmarbiol/)并保存每个句子,其中包含一个字符串' research'。

这只是我从网站上提取所有文字的代码示例。

from bs4 import BeautifulSoup
from zipfile import ZipFile
import os
html_page = "example.html" #i saved this page as example locally

data = []
with open(html_page, "r") as html:
    soup = BeautifulSoup(html, "lxml")
    text_group = soup.get_text()

print text_group

执行仅导出包含单词' research'

的句子的最佳方法是什么?

有没有比使用.split和分隔符更优雅的方法? 可以用" re"?

完成某些事情

非常感谢你的帮助,因为我对这个话题非常陌生。

致以最诚挚的问候,

Trgovec

3 个答案:

答案 0 :(得分:1)

一旦你有汤,你可以试试:

for tag in soup.descendants:
    if tag.string and 'research' in tag.string:
       print(tag.string)

使用XPath的替代方案更快,因为您安装了lxml

from lxml import etree
with open(html_page, "r") as html:
    tree = etree.parse(html, parser=etree.HTMLParser())
[e.text for e in tree.xpath("//*[contains(text(), 'research')]")]

答案 1 :(得分:1)

考虑到文档中没有严格定义“句子”,听起来你需要使用一种将明文分成句子的工具。

NLTK包很适合这种事情。你会想做一些像

这样的事情
import nltk
sentences = nltk.sent_tokenize(text)
result = [sentence for sentence in sentences if "research" in sentence]

它并不完美(它不明白你的文档中的“M.Sc.”不是一个单独的句子),但句子分割是一个看似复杂的任务,这就像你会得到的一样好

答案 2 :(得分:0)

In [65]: soup.find_all(name=['p', 'li'], text=re.compile(r'research'))
Out[65]: 
[<p class="bodytext">The M.Sc. programme Marine Biology is strongly research-orientated. The graduates are trained to develop hypotheses-driven research concepts and to design appropriate experimental approaches in order to answer profound questions related to the large field of marine ecosystem and organism functioning and of potential impacts of local, regional and global environmental change. 
 </p>,
 <p class="bodytext">Many courses are actually taught in the laboratories and facilities of the institutes benefiting from cutting-edge research infrastructure and first-hand contact to leading experts. This unique context sets the scene for direct links from current state of research to academic training.</p>,
 <li>Training in state-of-the-art methodologies by leading research teams.</li>,
 <li>Advanced courses in different university departments and associated research institutions.</li>,
 <li>Field trips, excursions or even the opportunity to participate in research expeditions. </li>,
 <p class="bodytext">The University of Bremen and the associated research institutions offer a variety of opportunities to continue an academic career as Ph.D. candidate.
 </p>,
 <p class="bodytext">Employment opportunities for Marine Biologists exist worldwide at institutions committed to research and development, in the fishing and aquaculture industry as well as in the environmental conservation and management sector at governmental agencies or within NGOs and IGOs. Marine biologists also work at museums, zoological gardens, and aquaria. Additional employment opportunities for marine biologists include adjacent fields such as media (i.e. scientific journalism), eco-consulting, environmental impact assessments, and eco-tourism business. Marine biologists are also employed in the commercial and industrial sector, for instance for "Blue Biotechnology", coastal zone management and the sustainable use of marine resources.</p>]