Python:在关键字之间解析文本

时间:2015-10-28 04:14:25

标签: python regex web-scraping beautifulsoup

我正在寻求使用BeautifulSoup解析某类网页上的文字,代码如下:

import urllib 
import re

html = urllib.urlopen('http://english.hani.co.kr/arti/english_edition/e_national/714507.html').read()
content= str(soup.find("div",  class_="article-contents"))

所以我的目标是至少解析第一段中的第一句或前几句。

由于这些段落未被<p>标记包围,因此到目前为止,我的最佳策略是在内容中查找介于</h4>和{{之间的文本1}}(恰好是第一段)

以下是目标文本的外观:

<p>

(这是我要解析的内容,<div class="article-contents"> <div class="article-alignC"> <table class="photo-view-area"> <tr> <td> <img alt="" border="0" src="http://img.hani.co.kr/imgdb/resize/2015/1024/00542577201_20151024.JPG" style="width:590px;"/> </td> </tr> </table> </div> <h4></h4> <h4>之间的内容 <p>

我试图在BeautifulSoup上直接使用或使用正则表达式,但到目前为止仍未成功。

1 个答案:

答案 0 :(得分:3)

使用find_next_sibling()找到h4元素,找到第一个下一个文本同级

h4 = soup.select_one("div.article-contents > h4")
print(h4.find_next_sibling(text=True))

打印:

US scholar argues that any government attempt to impose single view of history is misguided On Oct. 19, the Hankyoreh’s Washington correspondent conducted on interview with phone and email with William North, chair of the history department at Carleton University in Minnesota. The main topic of the discussion was the efforts of the administration of South Korean President Park Geun-hye to take over the production of history textbooks. 

嗯,实际上,只使用.next_sibling就足够了:

print(h4.next_sibling)