描述

Question

我一直在尝试生成一个可以找到html标签属性的正则函数现在有一段时间，但他们似乎都以这种或那种方式失败。

使用正则表达式因为加载beautifulsoup只需要检查一个html标记就需要很长时间。

以下是需要检查的标记/属性的示例：

<meta content="http://domain.com/path/path/file.jpg" rnd_attr="blah blah"      
   property="og:image"/>

正则表达式如何检索此标记的内容，同时确保标记为“og：image”。

很抱歉，如果这个问题有点天真，或者它完全不可能难以生成正则表达式。

BONUS：除了BeautifulSoup之外，在Python中还有哪些其他快速/可行的DOM解析方法可供选择？

感谢。

Answer 1

你真的对它进行了基准测试，发现BeautifulSoup是瓶颈吗？

content = soup.find('meta', property='og:image').get('content')

你也可以使用lxml，这要快得多：

import lxml.html

root = lxml.html.fromstring(html)  # Use .parse() on a file-like object instead

content = root.xpath('/html/head/meta[@property="og:image"][1]/@content')

Answer 2

描述

此表达式

找到具有属性property="og:image"
避免一些非常困难的边缘案例
捕获内容属性的值
允许属性以任何顺序显示

<meta(?=\s|>)(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\sproperty=(?:'og:image|"og:image"|og:image))(?=(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?\scontent=('[^']*'|"[^"]*"|[^'"][^\s>]*))(?:[^'">=]*|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>

enter image description here

实施例

在这个实例中，请注意前两个元标记示例文本中的困难边缘情况：http://www.rubular.com/r/YY70uaGPLE

示例文字

<meta info=' content="DontFindMe" ' content="http://domain.com/path/path/file1.jpg" random_attr="blah blah"      
   property="og:image"/>
<meta content="http://domain.com/path/path/file2.jpg" random_attr="blah blah"      
   property="og:image"/>
<meta random_attr="blah blah"   property='og:image' content="foo'"   />

<强>匹配

[0][0] = <meta info=' content="DontFindMe" ' content="http://domain.com/path/path/file1.jpg" random_attr="blah blah"      
   property="og:image"/>
[0][1] = "http://domain.com/path/path/file1.jpg"


[1][0] = <meta content="http://domain.com/path/path/file2.jpg" random_attr="blah blah"      
   property="og:image"/>
[1][1] = "http://domain.com/path/path/file2.jpg"


[2][0] = <meta random_attr="blah blah"   property='og:image' content="foo'"   />
[2][1] = "foo'"

Answer 3

使用Scrapy：

sel = Selector(response)
fb_description = sel.xpath('//meta[@property="og:description"]/@content').extract()

python中的正则表达式找到某个HTML标记的属性？

3 个答案:

描述

实施例