Question

str="<p class=\"drug-subtitle\"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>"

br=re.match("<p> class=\"drug-subtitle\"[^>]*>(.*?)</p>",str)

br返回无

我使用的正则表达式中的错误是什么？

Answer 1

固定的正则表达式将是这一个。检查我指向的第二行，你会找到它不适合你的地方。我使用findall()可以轻松访问屏幕上所有匹配的组。

print re.findall('<p class="drug-subtitle"[^>]*>(.*?)</p>',input)
                    ^ you had a > character here

但是，BeautifulSoup对于这种行为来说是很容易的选择：

input='''
<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>
'''
soup = BeautifulSoup(input)
br = soup.find("p", {"class": "drug-subtitle"})
print str(br)

Answer 2

我非常强烈建议使用 DOM Parser 库，例如lxml以及cssselect来执行此操作。

示例：

>>> from lxml.html import fromstring >>> html = """Generic Name: albuterol inhalation (al BYOO ter all) Brand Names: Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA""" >>> doc = fromstring(html) >>> "".join(filter(None, (e.text for e in doc.cssselect(".drug-subtitle")[0]))) 'Generic Name:Brand Names:Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA'

Answer 3

如果你得到了输入：

'<p class="drug-subtitle"><b>Generic Name:</b> albuterol inhalation (al BYOO ter all)<br><b>Brand Names:</b> <i>Accuneb, ProAir HFA, Proventil, Proventil HFA, ReliOn Ventolin HFA, Ventolin HFA</i></p>'

您要检查是否：

<p class="drug-subtitle"> .. some items here .. </p>

存在于您的输入中，要使用的正则表达式为：

\<p\sclass=\"drug-subtitle\"[^>]*>(.*?)\<\/p\>

描述：

\< matches the character < literally
p matches the character p literally (case sensitive)
\s match any white space character [\r\n\t\f ]
class= matches the characters class= literally (case sensitive)
\" matches the character " literally
drug-subtitle matches the characters drug-subtitle literally (case sensitive)
\" matches the character " literally
[^>]* match a single character not present in the list below
    Quantifier: Between zero and unlimited times, as many times as possible,
               giving back as needed.
    > a single character in the list &gt; literally (case sensitive)
> matches the character > literally
1st Capturing group (.*?)
    .*? matches any character (except newline)
        Quantifier: Between zero and unlimited times, as few times as possible,
                    expanding as needed.
\< matches the character < literally
\/ matches the character / literally
p matches the character p literally (case sensitive)
\> matches the character > literally

所以正则表达式中的问题是：

in＆lt; p为H.应该没有“＆gt;”。
in＆lt; / p为H.你应该逃避“＆lt;，/，＆gt;”在他们之前添加“\”字符。

在python中使用正则表达式匹配html标记

3 个答案: