我想从网页获取所有GET和POST参数。假设有一些网页。我可以从这个页面获得所有链接。但如果此页面采用输入参数(GET和POST),我该如何获取它们?我的算法是这样的:
find in web page this type of strings <form method="GET">...</form>;
then for each found result:
get <input> fields and construct request
then save it somewhere
我的目的是编写爬虫,从网站获取所有链接,GET和POST参数,然后将其保存在某处以供进一步分析。我的算法很简单,所以我想知道有没有其他方法(在python中)?你能推荐任何python库吗?
答案 0 :(得分:0)
这样的事情怎么样才能让你入门?它提取表单和输入属性:
from BeautifulSoup import BeautifulSoup
s = urllib2.urlopen('http://stackoverflow.com/questions/10614974/how-to-get-post-and-get-parameters-from-web-page-in-python').read()
soup = BeautifulSoup(s)
forms = soup.findall('form')
for form in forms:
print 'form action: %s (%s)' % (form['action'], form['method'])
inputs = form.findAll('input')
for input in inputs:
print " -> %s" % (input.attrs)
输出(本页):
form action: /search (get)
-> [(u'autocomplete', u'off'), (u'name', u'q'), (u'class', u'textbox'), (u'placeholder', u'search'), (u'tabindex', u'1'), (u'type', u'text'), (u'maxlength', u'140'), (u'size', u'28'), (u'value', u'')]
form action: /questions/10614974/answer/submit (post)
-> [(u'id', u'fkey'), (u'name', u'fkey'), (u'type', u'hidden'), (u'value', u'923d3d8b45bbca57cbf0b126b2eb9342')]
-> [(u'id', u'author'), (u'name', u'author'), (u'type', u'text')]
-> [(u'id', u'display-name'), (u'name', u'display-name'), (u'type', u'text'), (u'size', u'30'), (u'maxlength', u'30'), (u'value', u''), (u'tabindex', u'105')]
-> [(u'id', u'm-address'), (u'name', u'm-address'), (u'type', u'text'), (u'size', u'40'), (u'maxlength', u'100'), (u'value', u''), (u'tabindex', u'106')]
-> [(u'id', u'home-page'), (u'name', u'home-page'), (u'type', u'text'), (u'size', u'40'), (u'maxlength', u'200'), (u'value', u''), (u'tabindex', u'107')]
-> [(u'id', u'submit-button'), (u'type', u'submit'), (u'value', u'Post Your Answer'), (u'tabindex', u'110')]