Question

我想从网页获取所有GET和POST参数。假设有一些网页。我可以从这个页面获得所有链接。但如果此页面采用输入参数（GET和POST），我该如何获取它们？我的算法是这样的：

find in web page this type of strings <form method="GET">...</form>;
then for each found result:
     get <input> fields and construct request
     then save it somewhere

我的目的是编写爬虫，从网站获取所有链接，GET和POST参数，然后将其保存在某处以供进一步分析。我的算法很简单，所以我想知道有没有其他方法（在python中）？你能推荐任何python库吗？

Answer 1

这样的事情怎么样才能让你入门？它提取表单和输入属性：

from BeautifulSoup import BeautifulSoup

s = urllib2.urlopen('http://stackoverflow.com/questions/10614974/how-to-get-post-and-get-parameters-from-web-page-in-python').read()
soup = BeautifulSoup(s)

forms = soup.findall('form')
for form in forms:
  print 'form action: %s (%s)' % (form['action'], form['method'])
  inputs = form.findAll('input')
  for input in inputs:
    print "  -> %s" % (input.attrs)

输出（本页）：

form action: /search (get)
  -> [(u'autocomplete', u'off'), (u'name', u'q'), (u'class', u'textbox'), (u'placeholder', u'search'), (u'tabindex', u'1'), (u'type', u'text'), (u'maxlength', u'140'), (u'size', u'28'), (u'value', u'')]
form action: /questions/10614974/answer/submit (post)
  -> [(u'id', u'fkey'), (u'name', u'fkey'), (u'type', u'hidden'), (u'value', u'923d3d8b45bbca57cbf0b126b2eb9342')]
  -> [(u'id', u'author'), (u'name', u'author'), (u'type', u'text')]
  -> [(u'id', u'display-name'), (u'name', u'display-name'), (u'type', u'text'), (u'size', u'30'), (u'maxlength', u'30'), (u'value', u''), (u'tabindex', u'105')]
  -> [(u'id', u'm-address'), (u'name', u'm-address'), (u'type', u'text'), (u'size', u'40'), (u'maxlength', u'100'), (u'value', u''), (u'tabindex', u'106')]
  -> [(u'id', u'home-page'), (u'name', u'home-page'), (u'type', u'text'), (u'size', u'40'), (u'maxlength', u'200'), (u'value', u''), (u'tabindex', u'107')]
  -> [(u'id', u'submit-button'), (u'type', u'submit'), (u'value', u'Post Your Answer'), (u'tabindex', u'110')]

如何在python中从Web页面获取POST和GET参数

1 个答案: