查找所有文本,直到下一个正则表达式匹配

时间:2015-09-14 21:32:06

标签: python regex

我正在尝试编译所有文本,直到下一次与python中的regex匹配。这些数据是在线提供的辩论记录。

目前我正在尝试遍历p标签的所有匹配并识别带有标签扬声器的匹配,然后将没有标记扬声器的所有连续文本附加到上一个匹配。

我不确定这是否是继续进行的最佳方式,或者只是简单地一次搜索和分组所有文本会更容易。目前,我只能看到所有以至少三个大写字母开头的文字。

import re    
import requests as rq
from bs4 import BeautifulSoup as bs

r = rq.get('http://www.cbsnews.com/news/transcript-of-the-2015-gop-debate-9-pm/')
b = bs(r.text, 'html.parser')
debatetext = b.find('div', attrs= {'class' , 'entry'}).findAll('p')
pattern = re.compile(r'[A-Z][A-Z][A-Z].*:')
for line in debatetext:
        if re.search(pattern, line.text) is not None:
                print line

示例文字

<p> BUSH:  Here's what I believe.  I believe we're at the verge of the greatest time to be alive in this world.  </p>
<p>   But Washington is holding us back.  How we tax, how we regulate. We're not embracing the energy revolution in our midst, a broken immigration system that has been politicized rather than turning it into an economic driver.  </p>
<p>   We're not protecting and preserving our entitlement system or reforming for the next generation.  All these things languish while we have politicians in Washington using these as wedge issues.  </p>
<p>   Here's my commitment to you, because I did it as Florida.  We can fix these things.  We can grow economically and restore America's leadership in the world, so that everybody has a chance to rise up.  I humbly ask for your vote, whenever you're gonna get to vote, whenever the primary is.  Thank you all very much.  </p> 

理想情况下,我想在没有“BUSH:”的情况下将三行添加到第一个语句中,或者添加“BUSH:”或其他任何候选人在该行的开头说话。

编辑:更大的样本

    <div class="entry" itemprop="articleBody" id="article-entry">...


<p>   CARSON:  -- extremely effectively.</p>
<p>   (APPLAUSE)</p>
<p>   BAIER:  Gentlemen, the next series of questions deals with ObamaCare and the role of the federal government.</p>
<p>   Mr. Trump, ObamaCare is one of the things you call a disaster.</p>
<p>   TRUMP:  A complete disaster, yes.</p>
<p>   BAIER:  Saying it needs to be repealed and replaced.</p>
<p>   TRUMP:  Correct.</p>
<p>   BAIER:  Now, 15 years ago, uncalled yourself a liberal on health care.  You were for a single-payer system, a Canadian-style system.</p>
<p>   Why were you for that then and why aren't you for it now?  TRUMP:  First of all, I'd like to just go back to one.  In July of 2004, I came out strongly against the war with Iraq, because it was going to destabilize the Middle East.  And I'm the only one on this stage that knew that and had the vision to say it.  And that's exactly what happened.</p>
<p>   BAIER:  But on ObamaCare...</p>
<p>   TRUMP:  And the Middle East became totally destabilized.  So I just want to say.</p>
<p>   As far as single payer, it works in Canada.  It works incredibly well in Scotland.  It could have worked in a different age, which is the age you're talking about here.</p>
<p>   What I'd like to see is a private system without the artificial lines around every state.  I have a big company with thousands and thousands of employees.  And if I'm negotiating in New York or in New Jersey or in California, I have like one bidder.  Nobody can bid.</p>
<p>   You know why?</p>
<p>   Because the insurance companies are making a fortune because they have control of the politicians, of course, with the exception of the politicians on this stage.</p>
<p>   But they have total control of the politicians.  They're making a fortune.</p>
<p>   Get rid of the artificial lines and you will have...</p>
<p>   (BUZZER NOISE)</p>
<p>   TRUMP:  -- yourself great plans.  And then we have to take care of the people that can't take care of themselves.  And I will do that through a different system.</p>
<p>   (CROSSTALK)</p>
<p>   BAIER:  Mr. Trump, hold up one second.</p>
<p>   PAUL:  I've got a news flash...</p>

2 个答案:

答案 0 :(得分:1)

我稍微重新格式化了我的正则表达式,所以它看起来像这样:

pattern = re.compile(r'([A-Z]+):(.*)')

+给了我一个或不限字母的大写字母,所以这只是从以前的正则表达式代码中清理一下。 我还将其更改为创建捕获组,第一个是“&#39;”之前的任何大写字母,第二个是&#39;之后的任何文本:&#39;。

现在第二个匹配(组(0)是整个匹配,组(1)是名称)可以用于附加到字典,并且可以追加连续文本。

为了处理有关添加此初始正则表达式模式之后的缺失语句的问题,我使用了状态机。 请注意,这只有效,因为我假设正则表达式匹配中的所有后续文本都应该属于从正则表达式模式中找到的发音者。

d = {}
name = ''
blurb = ''
state = 0
for line in debatetext:
        m = re.search(pattern, line.text)
        if m:
            name = m.group(1) 
            blurb = m.group(2)
            #skip past speakers section with all caps at beginning
            if name != 'SPEAKERS':
                state = 1                
                if name in d:
                    d[name].append(blurb)
                else:
                    d[name] = [ blurb ]
        else:
            if state:
                d[name].append(line.text)

这次采取了一些IRL帮助,但我认为这种解决方案在这种情况下效果很好,可能对其他人有所帮助。我用这个来解析第二次辩论,它运作得很好。我可能会修改它,以便按顺序添加语句,以便我可以结合twitter数据进行一些相关性分析。

答案 1 :(得分:0)

是的“我不确定这是否是最好的方法,或者只是简单地搜索和分组所有文本会更容易。”或者,“最佳”方式是您理解并解决问题的方式。这很快又脏,但应该让你开始。

import pprint

test_data="""    <div class="entry" itemprop="articleBody" id="article-entry">...


<p>   CARSON:  -- extremely effectively.</p>
<p>   (APPLAUSE)</p>
<p>   BAIER:  Gentlemen, the next series of questions deals with ObamaCare and the role of the federal government.</p>
<p>   Mr. Trump, ObamaCare is one of the things you call a disaster.</p>
<p>   TRUMP:  A complete disaster, yes.</p>
<p>   BAIER:  Saying it needs to be repealed and replaced.</p>
<p>   TRUMP:  Correct.</p>
<p>   BAIER:  Now, 15 years ago, uncalled yourself a liberal on health care.  You were for a single-payer system, a Canadian-style system.</p>
<p>   Why were you for that then and why aren't you for it now?  TRUMP:  First of all, I'd like to just go back to one.  In July of 2004, I came out strongly against the war with Iraq, because it was going to destabilize the Middle East.  And I'm the only one on this stage that knew that and had the vision to say it.  And that's exactly what happened.</p>
<p>   BAIER:  But on ObamaCare...</p>
<p>   TRUMP:  And the Middle East became totally destabilized.  So I just want to say.</p>
<p>   As far as single payer, it works in Canada.  It works incredibly well in Scotland.  It could have worked in a different age, which is the age you're talking about here.</p>
<p>   What I'd like to see is a private system without the artificial lines around every state.  I have a big company with thousands and thousands of employees.  And if I'm negotiating in New York or in New Jersey or in California, I have like one bidder.  Nobody can bid.</p>
<p>   You know why?</p>
<p>   Because the insurance companies are making a fortune because they have control of the politicians, of course, with the exception of the politicians on this stage.</p>
<p>   But they have total control of the politicians.  They're making a fortune.</p>
<p>   Get rid of the artificial lines and you will have...</p>
<p>   (BUZZER NOISE)</p>
<p>   TRUMP:  -- yourself great plans.  And then we have to take care of the people that can't take care of themselves.  And I will do that through a different system.</p>
<p>   (CROSSTALK)</p>
<p>   BAIER:  Mr. Trump, hold up one second.</p>
<p>   PAUL:  I've got a news flash...</p>"""

## look for 3 capital letters
## assume every line starts with "<p>" (so won't test for it)

one_group=[]
for record in test_data.split("\n"):
    record=record.strip()
    if len(record):
        split_rec=record.split()
        found=True
        for ltr in split_rec[1][:3]:
            if ltr < "A" or ltr > "Z":
                found=False

        ## found new name so print previous block
        if found and len(one_group):
            pprint.pprint(one_group)
            print
            one_group=[]
        one_group.append(record)

## last group
print one_group