如何编写一个正则表达式,它将无序列表和段落放在它前面

时间:2016-01-31 04:37:33

标签: python regex

我有一个美丽的汤对象,我已经转换为字符串,我想拉出所有项目符号列表和紧接在它们之前的段落。一个例子是以下字符串:

...
    <p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
    <ul>
    <li>You are experiencing a decrease in sales and customers</li>
    <li>If your brand design does not reflect what you deliver</li>
    <li>If you want to attract a new target audience</li>
    <li>Management change</li>
    <li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
    </ul>
...

我使用以下正则表达式:

re.findall('<p>.*</p>\n<ul>.*</ul>', string)

然而,它返回一个空列表。最好的方法是什么?

2 个答案:

答案 0 :(得分:1)

不要使用正则表达式来解析HTML!

BeautifulSoup可以轻松,优雅,正确地完成您想做的一切:

>>> soup = bs4.BeautifulSoup(r"""
    <p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
    <ul>
    <li>You are experiencing a decrease in sales and customers</li>
    <li>If your brand design does not reflect what you deliver</li>
    <li>If you want to attract a new target audience</li>
    <li>Management change</li>
    <li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
    </ul>
""")
>>> bulleted_lists = soup.findAll('ul')
>>> uls_with_ps = [(ul.findPrevious('p'), ul) for ul in bulleted_lists]

要了解正在发生的事情:

>>> bulleted_lists
[<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>]

>>> bulleted_lists[0].findPrevious('p')
<p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>

答案 1 :(得分:0)

为什么你需要regex而beautifulsoup能够完全处理任何类型的html-最好你在这里css selectors div.Mother div.Son ul li表示选择所有divs的类名为Mother然后在其中选择所有divs,其中包含类名Son,然后在其中选择ul,最后选择li内的所有ul

from bs4 import BeautifulSoup as bs

data = """

    <body>
    <div class="Mother" >
        <div class="Son" >
            <p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
            <ul>
                <li>You are experiencing a decrease in sales and customers</li>
                <li>If your brand design does not reflect what you deliver</li>
                <li>If you want to attract a new target audience</li>
                <li>Management change</li>
                <li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
            </ul>
        </div>
    </div>
</body>

"""

soup = bs(data,'lxml')
#To grab all inside the ul
for item in soup.select('div.Mother div.Son'):
    print item.text.strip()
print  "="*100
#Just to grab all li    
for li in soup.select('div.Mother div.Son ul li'):
    print li.text.strip()

输出 -

It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:

You are experiencing a decrease in sales and customers
If your brand design does not reflect what you deliver
If you want to attract a new target audience
Management change
19 Questions to Ask Yourself Before You Start Rebranding
====================================================================================================
You are experiencing a decrease in sales and customers
If your brand design does not reflect what you deliver
If you want to attract a new target audience
Management change
19 Questions to Ask Yourself Before You Start Rebranding