如何迭代BeautifulSoup以获取网站上所有表单的所有操作(链接)?

时间:2018-04-22 10:11:18

标签: python python-2.7 beautifulsoup python-requests

我写了以下内容来从表单中删除操作以保存我一次点击一次

import requests
from bs4 import BeautifulSoup 
import re

with requests.Session() as c:
    url = 'https://website.com/login'
    EMAIL = ''
    PASSWORD = ''
    c.get(url)
    login_data = dict(email=EMAIL, password=PASSWORD)
    c.post(url, data=login_data, headers={"Referer": 
    "https://website.com/"})
    page = c.get('https://website.com/dashboard')


parser = BeautifulSoup(page.content, 'html.parser')
forms = parser.find('form').get('action')

当我运行这个时,我只从第一个表单中获得结果。 如果我可以迭代这个以获得一个解决方案的所有结果。

我可以将查找更改为

parser.find_all('form')

它将返回所有表单,但不是可用的链接我得到

<form accept-charset="UTF-8" action="https://website.com/action" method="GET">
<input class="button" type="submit" value="action"/>
</form>    

它将这些存储在python列表中,因此如果可以迭代这些以删除除url之外的所有内容(它们总是相同的格式,稍微不同的长度url但前后的内容总是相同的。)那是另一个溶液

如果我尝试使用

parser.find_all_next('form').get('action')

我收到以下错误

Traceback (most recent call last):
File "scrape.py", line 16, in <module>
forms = parser.find_all_next('form').get('action')
File "/home/username/.local/lib/python2.7/site-packages/bs4/element.py", line 1807, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a singleitem. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

1 个答案:

答案 0 :(得分:2)

您只需循环parser.find_all('form')并获取每个元素的action属性并将其存储在列表中;这可以使用列表理解来完成。

with requests.Session() as c:
    url = 'https://website.com/login'
    EMAIL = ''
    PASSWORD = ''
    c.get(url)
    login_data = dict(email=EMAIL, password=PASSWORD)
    c.post(url, data=login_data, headers={"Referer": 
    "https://website.com/"})
    page = c.get('https://website.com/dashboard')    

parser = BeautifulSoup(page.content, 'html.parser')
forms = [f.get('action') for f in parser.find_all('form')]

所有网址的列表都存储在forms变量中。