我正在尝试使用以下代码:
import re
from bs4 import BeautifulSoup
htmlsource1 = """<div class="small-12 columns ">
<h5 class="clsname1 large-text seq2">text1</h5>
<h5 class="clsname1 small-text seq1">text2</h5>
<h5 class="clsname1 seq1 small-text clsname2">text3</h5>
</div>"""
soup = BeautifulSoup(htmlsource1, "html.parser")
interesting_h5s = soup.find_all('h5', class_=re.compile('^(?=.*\bsmall-text\b)(?=.*\bseq1\b).*$'))
for h5 in interesting_h5s:
print h5
我的目的是提取包含&#34;小文本&#39;和&#39; seq1&#39;类(以任何顺序)但由于某种原因,尽管正在http://pythex.org中正面测试正则表达式,它仍无法正常运行。
对于正则表达式,我调整了Regex to match string containing two names in any order
中提供的答案感谢您的任何建议。
答案 0 :(得分:0)
您应该使用html解析工具,但似乎您可以对HTML进行创造性控制,因此可能的边缘情况将受到限制。
<h5(?=\s)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sclass=['"](?=[^"]*\bsmall-text\b)(?=[^"]*\bseq1\b)([^"]*)['"]?)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\s]*)*"\s?\/?>(.*?)</h5>
此正则表达式将执行以下操作:
现场演示
https://regex101.com/r/fR0mT7/2
示例文字
请注意最后两个h5
标记中的困难边缘情况
<div class="small-12 columns ">
<h5 class="clsname1 large-text seq2">text1</h5>
<h5 class="clsname1 small-text seq1">text2</h5>
<h5 class="clsname1 seq1 small-text clsname2">text3</h5>
<h5 onmouseover=' class="small-text seq1" ; ' class="clsname1 large-text seq2">text4</h5>
<h5 onmouseover=' class="small-text seq1" ; ' class="clsname1 small-text seq1">text5</h5>
</div>
样本匹配
h5
代码h5
标记[0][0] = <h5 class="clsname1 small-text seq1">text2</h5>
[0][1] = clsname1 small-text seq1
[0][2] = text2
[1][0] = <h5 class="clsname1 seq1 small-text clsname2">text3</h5>
[1][1] = clsname1 seq1 small-text clsname2
[1][2] = text3
[2][0] = <h5 onmouseover=' class="small-text seq1" ; ' class="clsname1 small-text seq1">text5</h5>
[2][1] = clsname1 small-text seq1
[2][2] = text5
NODE EXPLANATION
----------------------------------------------------------------------
<h5 '<h5'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the least amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"] any character except: ''', '"'
----------------------------------------------------------------------
[^\s>]* any character except: whitespace (\n,
\r, \t, \f, and " "), '>' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
)*? end of grouping
----------------------------------------------------------------------
\s whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
class= 'class='
----------------------------------------------------------------------
['"] any character of: ''', '"'
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
small-text 'small-text'
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
seq1 'seq1'
----------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
['"]? any character of: ''', '"' (optional
(matching the most amount possible))
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=' '=\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
=" '="'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
[^'"\s]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
</h5> '</h5>'
----------------------------------------------------------------------
答案 1 :(得分:0)
根据Disable special "class" attribute handling文章,通过添加以下代码行来解决问题:
from bs4.builder import HTMLParserTreeBuilder
bb = HTMLParserTreeBuilder()
bb.cdata_list_attributes["*"].remove("class")
soup = BeautifulSoup(bs, "html.parser", builder=bb)