提取字段HTML表单的名称 - Python

时间:2011-08-02 11:00:23

标签: python parsing

假设有一个链接“http://www.someHTMLPageWithTwoForms.com”,它基本上是一个有两种形式的HTML页面(比如Form 1和Form 2)。我有这样的代码...

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
h = httplib2.Http('.cache')
response, content = h.request('http://www.someHTMLPageWithTwoForms.com')
for field in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
        if field.has_key('name'):
                print field['name']

这将返回属于HTML页面的Form 1和Form 2的所有字段名称。有什么方法我只能得到属于特定表格的字段名称(仅限表格2)?

4 个答案:

答案 0 :(得分:3)

如果它只有2个表格,你可以试试这个:

from BeautifulSoup import BeautifulSoup

forms = BeautifulSoup(content).findAll('form')
for field in forms[1]:
    if field.has_key('name'):
            print field['name']

如果它不仅仅是关于第二种形式你更具体(通过id或类别归属

from BeautifulSoup import BeautifulSoup

forms = BeautifulSoup(content).findAll(attrs={'id' : 'yourFormId'})
for field in forms[0]:
    if field.has_key('name'):
            print field['name']

答案 1 :(得分:1)

如果您有属性名称和值,则可以搜索

from BeautifulSoup import BeautifulStoneSoup
xml = '<person name="Bob"><parent rel="mother" name="Alice">'
xmlSoup = BeautifulStoneSoup(xml)

xmlSoup.findAll(name="Alice")
# []

答案 2 :(得分:0)

使用lxml进行此类解析也很容易(由于BeautifulSoup支持,我个人更喜欢Xpath。例如,以下代码段将打印属于名为“form2”的表单的所有字段名称(如果有的话):

# you can ignore this part, it's only here for the demo
from StringIO import StringIO
HTML = StringIO("""
<html>
<body>
    <form name="form1" action="/foo">
        <input name="uselessInput" type="text" />
    </form>
    <form name="form2" action="/bar">
        <input name="firstInput" type="text" />
        <input name="secondInput" type="text" />
    </form>
</body>
</html>
""")

# here goes the useful code
import lxml.html
tree = lxml.html.parse(HTML) # you can pass parse() a file-like object or an URL
root = tree.getroot()
for form in root.xpath('//form[@name="form2"]'):
    for field in form.getchildren():
        if 'name' in field.keys():
            print field.get('name')

答案 3 :(得分:0)

如果你安装了lxml和cssselect python包:

from lxml import html
def parse_form(form):
    tree = html.fromstring(form)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data