让我首先展示我将要处理的3种不同类型的字符串:
"<h1>Money Shake</h1><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"
"<h1>Money Shake</h1><p>Posted by Gordon Gekko</p><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"
"<h1>Money Shake</h1><p>Posted by Gordon Gekko</p><p>They're great</p><p>Yield: KA-CHING</p><p>Money<br>Money<br>MORE MONEY</p><p>Take money and stuff in blender.</p><p>Blend.</p>"
基本上,我希望做的是撕掉具有成分的块:
"<p>Money<br>Money<br>MORE MONEY</p>"
这是我正在使用的正则表达式:
re.search(r'<p>[^</p>](.*)<br>(.*?)</p>', string, re.I)
当我在第一个和第二个字符串上使用它时,它完全符合我的要求并返回给我这个匹配对象:
"<p>Money<br>Money<br>MORE MONEY</p>"
但是当我在第三个字符串上使用它时,它会返回这个匹配对象:
"<p>They're great</p><p>Yield: KA-CHING</p><p>Money<br>Money<br>MORE MONEY</p>"
我搞砸了什么?
@Blender
嗨Blender,这就是我想要抓住我想要的块。我确信有更好的方法,但考虑到我已经进入Python /编程2周了:
def get_ingredients(soup):
for p in soup.find_all('p'):
if p.find('br'):
return p
ingredients = get_ingredients(soup)
p_list = soup.find_all('p')
ingredient_index = p_list.index(ingredients)
junk = []
junk += p_list[:ingredient_index]
instructions = []
instructions += p_list[ingredient_index+1:]
答案 0 :(得分:3)
只需使用正确的HTML解析器即可。它比正则表达式更直观,并且实际上可以工作:
# May need to install it:
# pip install BeautifulSoup4
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<h1>Money Shake</h1>
<p>Posted by Gordon Gekko</p>
<p>They're great</p>
<p>Yield: KA-CHING</p>
<p>
Money
<br>
Money
<br>
MORE MONEY
</p>
<p>Take money and stuff in blender.</p>
<p>Blend.</p>
""")
def get_ingredients(soup):
for p in soup.find_all('p'):
if p.find('br'):
return p.find_all(text=True)