用于HTML解析的Python正则表达式(BeautifulSoup)

时间:2008-09-10 21:49:54

标签: python regex screen-scraping

我想在HTML中获取隐藏输入字段的值。

<input type="hidden" name="fooId" value="12-3456789-1111111111" />

我想在Python中编写一个正则表达式,它将返回fooId的值,因为我知道HTML中的行遵循格式

<input type="hidden" name="fooId" value="**[id is here]**" />

有人可以在Python中提供一个示例来解析HTML的值吗?

7 个答案:

答案 0 :(得分:27)

对于这个特殊情况,BeautifulSoup比正则表达式更难写,但它更强大......我只是贡献了BeautifulSoup示例,因为你已经知道使用哪个正则表达式: - )

from BeautifulSoup import BeautifulSoup

#Or retrieve it from the web, etc. 
html_data = open('/yourwebsite/page.html','r').read()

#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId.attrs[2][1] #The value of the third attribute of the desired tag 
                          #or index it directly via fooId['value']

答案 1 :(得分:18)

我同意Vinko BeautifulSoup是要走的路。不过,我建议使用fooId['value']get the attribute,而不是依赖于值作为第三个属性。

from BeautifulSoup import BeautifulSoup
#Or retrieve it from the web, etc.
html_data = open('/yourwebsite/page.html','r').read()
#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId['value'] #The value attribute

答案 2 :(得分:8)

import re
reg = re.compile('<input type="hidden" name="([^"]*)" value="<id>" />')
value = reg.search(inputHTML).group(1)
print 'Value is', value

答案 3 :(得分:5)

解析是你真的不想自己动手的地方之一,如果你可以避免它,因为你将追逐边缘案例和错误多年来

我建议使用BeautifulSoup。它具有非常好的声誉,从文档中看起来很容易使用。

答案 4 :(得分:1)

Pyparsing是BeautifulSoup和正则表达式之间的一个很好的临时步骤。它比正则表达式更强大,因为它的HTML标记解析包含大小写,空白,属性存在/缺失/顺序的变化,但比使用BS更容易进行这种基本标记提取。

您的示例特别简单,因为您要查找的所有内容都在打开“input”标记的属性中。这是一个pyparsing示例,显示输入标记的几个变体,它们可以使正则表达式适合,并且还显示如果标记位于注释中,如何匹配标记:

html = """<html><body>
<input type="hidden" name="fooId" value="**[id is here]**" />
<blah>
<input name="fooId" type="hidden" value="**[id is here too]**" />
<input NAME="fooId" type="hidden" value="**[id is HERE too]**" />
<INPUT NAME="fooId" type="hidden" value="**[and id is even here TOO]**" />
<!--
<input type="hidden" name="fooId" value="**[don't report this id]**" />
-->
<foo>
</body></html>"""

from pyparsing import makeHTMLTags, withAttribute, htmlComment

# use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for
# opening and closing tags, we're only interested in the opening tag
inputTag = makeHTMLTags("input")[0]

# only want input tags with special attributes
inputTag.setParseAction(withAttribute(type="hidden", name="fooId"))

# don't report tags that are commented out
inputTag.ignore(htmlComment)

# use searchString to skip through the input 
foundTags = inputTag.searchString(html)

# dump out first result to show all returned tags and attributes
print foundTags[0].dump()
print

# print out the value attribute for all matched tags
for inpTag in foundTags:
    print inpTag.value

打印:

['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
- empty: True
- name: fooId
- startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
  - empty: True
  - name: fooId
  - type: hidden
  - value: **[id is here]**
- type: hidden
- value: **[id is here]**

**[id is here]**
**[id is here too]**
**[id is HERE too]**
**[and id is even here TOO]**

您可以看到,pyparsing不仅匹配这些不可预测的变体,还会返回对象中的数据,以便于读出各个标记属性及其值。

答案 5 :(得分:0)

/<input type="hidden" name="fooId" value="([\d-]+)" \/>/

答案 6 :(得分:0)

/<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>/

>>> import re
>>> s = '<input type="hidden" name="fooId" value="12-3456789-1111111111" />'
>>> re.match('<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>', s).groups()
('fooId', '12-3456789-1111111111')