Question

我在HTML页面中有一些这样的行：

<div>
    <p class="match"> this sentence should match </p> 
    some text
    <a class="a"> some text </a>  
</div>
<div> 
    <p class="match"> this sentence shouldnt match</p> 
    some text
    <a class ="b"> some text </a> 
</div>

我想提取<p class="match">内的行，但只有在div内有<a class="a">时才会提取。

我到目前为止所做的工作如下（我首先找到里面带有<a class="a">的段落，然后迭代结果找到<p class="match">内的句子：

import re
file_to_r = open("a")

regex_div = re.compile(r'<div>.+"a".+?</div>', re.DOTALL)

regex_match = re.compile(r'<p class="match">(.+)</p>')
for m in regex_div.findall(file_to_r.read()):
    print(regex_match.findall(m))

但我想知道是否还有其他（仍然有效）的方法可以立即执行此操作？

Answer 1

使用HTML解析器，例如BeautifulSoup。

使用a类找到a标记，然后使用类p找到find previous sibling - match标记：

from bs4 import BeautifulSoup

data = """
<div>
    <p class="match"> this sentence should match </p>
    some text
    <a class="a"> some text </a>
</div>
<div>
    <p class="match"> this sentence shouldn't match</p>
    some text
    <a class ="b"> some text </a>
</div>
"""

soup = BeautifulSoup(data)
a = soup.find('a', class_='a')
print a.find_previous_sibling('p', class_='match').text

打印：

this sentence should match

另请参阅为什么你应该避免在这里使用正则表达式来解析HTML：

RegEx match open tags except XHTML self-contained tags

Answer 2

你应该使用html解析器，但如果你仍然使用正则表达式，你可以使用这样的东西：

<div>\s*<p class="match">([\w\s]+)</p>[\w\s]+(?=<a class="a").*?</div>

<强> Working demo

enter image description here

Answer 3

 <div>\s*\n\s*.*?<p class=.*?>(.*?)<\/p>\s*\n\s*.*?\s*\n\s*(?=(\<a class=\"a\"\>))

你可以使用它。

参见演示。

http://regex101.com/r/lK9iD2/7

找到一个段落，并使用REGEX在此段落中找到一个字符串

3 个答案: