我有一个美丽的汤对象,我已经转换为字符串,我想拉出所有项目符号列表和紧接在它们之前的段落。一个例子是以下字符串:
...
<p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>
...
我使用以下正则表达式:
re.findall('<p>.*</p>\n<ul>.*</ul>', string)
然而,它返回一个空列表。最好的方法是什么?
答案 0 :(得分:1)
不要使用正则表达式来解析HTML!
BeautifulSoup可以轻松,优雅,正确地完成您想做的一切:
>>> soup = bs4.BeautifulSoup(r"""
<p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>
""")
>>> bulleted_lists = soup.findAll('ul')
>>> uls_with_ps = [(ul.findPrevious('p'), ul) for ul in bulleted_lists]
要了解正在发生的事情:
>>> bulleted_lists
[<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>]
>>> bulleted_lists[0].findPrevious('p')
<p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
答案 1 :(得分:0)
为什么你需要regex
而beautifulsoup能够完全处理任何类型的html-最好你在这里css selectors div.Mother div.Son ul li
表示选择所有divs
的类名为Mother
然后在其中选择所有divs
,其中包含类名Son
,然后在其中选择ul
,最后选择li
内的所有ul
。
from bs4 import BeautifulSoup as bs
data = """
<body>
<div class="Mother" >
<div class="Son" >
<p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>
</div>
</div>
</body>
"""
soup = bs(data,'lxml')
#To grab all inside the ul
for item in soup.select('div.Mother div.Son'):
print item.text.strip()
print "="*100
#Just to grab all li
for li in soup.select('div.Mother div.Son ul li'):
print li.text.strip()
输出 -
It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:
You are experiencing a decrease in sales and customers
If your brand design does not reflect what you deliver
If you want to attract a new target audience
Management change
19 Questions to Ask Yourself Before You Start Rebranding
====================================================================================================
You are experiencing a decrease in sales and customers
If your brand design does not reflect what you deliver
If you want to attract a new target audience
Management change
19 Questions to Ask Yourself Before You Start Rebranding