在python中使用正则表达式获取多条重复行

时间:2019-03-24 07:12:11

标签: python regex

我是RegEx的新手,并且有一个非常大的文本文件,其一小部分如下所示:

mapStruct

我想使用RegEx从中仅提取“摘要”文本:

<div class="hbk-preamble " id="preamble-APG5180">
<div class="hbk-preamble-entry">
<div class="hbk-preamble-icon hbk-preamble-icon_mode"></div>
<p class="hbk-preamble-heading">Offered</p>
<p><a href="index-bylocation-city-melbourne.html">City (Melbourne)</a></p><ul class="hbk-preamble-list__offerings"><li>Summer semester A 2019 (Flexible)</li></ul><p><a href="index-bylocation-clayton.html">Clayton</a></p><ul class="hbk-preamble-list__offerings"><li>First semester 2019 (On-campus)</li></ul>
</div>
</div>
<div class="notes">
<p class="hbk-heading hdg_6">Notes</p>
<p></p><ul>
<li>The unit may be offered as part of the <a class="hbk-screen-url" href="http://www.monash.edu/students/courses/arts/summer-program.html">Summer Arts Program</a><span class="hbk-print-url">Summer Arts Program (<a href="http://www.monash.edu/students/courses/arts/summer-program.html">http://www.monash.edu/students/courses/arts/summer-program.html</a>)</span>.</li>
<li>For more information please visit the <a class="hbk-screen-url" href="https://www.anzsog.edu.au/">ANZSOG webpage</a><span class="hbk-print-url">ANZSOG webpage (<a href="https://www.anzsog.edu.au/">https://www.anzsog.edu.au/</a>)</span>.</li>
</ul>
</div>
<h2 class="hbk-heading">Synopsis</h2>
<div>
<p>The media is one of the most important components of any political society. In a liberal democracy like Australia, its role and function have profound implications for the conduct of politics, the nature of democracy and public policy outcomes. In this unit, the relationship between the media, politics and public policy is studied from three broad perspectives. First, the politics of the media is investigated from the perspective of liberal democratic theory in order to understand the role of news media on the policy debate. Second, the political economy of the media is investigated. Particular emphasis is on the structure and operation of media organisations and journalists and how political news is covered. Third, the unit undertakes a study of the relationship between the media and political actors. Particular emphasis is on the use of public relations and 'spin doctors' in managing the media as well as the utilisation of political advertising and strategic political communication by governments and political agents.</p>
</div>
<h2 class="hbk-heading">Outcomes</h2>
<div>
<p>Upon successful completion of the unit students should have:</p>
<ol princestart="0" start="1" type="1">

我需要文本文件中每个部分的摘要文本,我该怎么办?

到目前为止,我已经使用阅读和阅读线阅读了我的文本文件,但是我无法建立开始的模式。

2 个答案:

答案 0 :(得分:1)

首先,我将不直接回答您的问题。我认为您的问题是X-Y problem。在您的情况下,您必须处理HTML,因此您为此准备了许多强大的工具。

看看Python的BeautifulSoup:

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html.parser')

然后您可以从此soup中提取所需的任何内容。

现在从您的问题开始,如果您仍然想使用正则表达式,则可以使用https://regex101.com来帮助您:

演示:https://regex101.com/r/AcozoW/1

<p.*?Notes.*?<li>(.+?)<\/li>

答案 1 :(得分:1)

我建议使用包beautifulsoup来做到这一点。您可以尝试这样的事情:

import requests
from bs4 import BeautifulSoup
data = requests.get('put website address here')
soup = BeautifulSoup(data.text, 'html.parser')
for i in soup.find_all('h2', {'class':'hbk-heading'}):
    print(i.text.strip())