desc = re.compile('<ul class="descShort bullet">(.*)</ul>', re.DOTALL)
findDesc = re.findall(desc, link_source)
for i in findDesc:
print i
'''
<ul class="descShort bullet">
Sleek and distinctive, these eye-catching ornaments will be the star of your holiday decor. These unique glass icicle ornaments are individually handcrafted by artisans in India.
</ul>
'''
我试图在ul class tag和/ ul之间提取描述。我正在寻找使用REGEX以及beautifulsoup的解决方案。
答案 0 :(得分:1)
首先,使用正则表达式解析HTML / XML通常被视为a bad idea。 因此,使用像BeautifulSoup这样的解析器确实是一个更好的主意。
您想要的是如下:
from BeautifulSoup import BeautifulSoup
text = """
<ul class="descShort bullet">text1</ul>
<a href="example.com">test</a>
<ul class="descShort bullet">one more</ul>
<ul class="other">text2</ul>
"""
soup = BeautifulSoup(text)
# to get the contents of all <ul> tags:
for tag in soup.findAll('ul'):
print tag.contents[0]
# to get the contents of <ul> tags w/ attribute class="descShort bullet":
for tag in soup.findAll('ul', {'class': 'descShort bullet'}):
print tag.contents[0]