我遇到了BeautifulSoup的麻烦。 这就是我想要做的事情:
对于我读过的每个html页面中的每个表单,我想获取“action”指向的URL。
这是我的代码:
def myfunction(path)
from bs4 import BeautifulSoup
#Retrieve htmlFiles from a folder
pages = find_files(path, '.html') #as a list
for page in pages:
stream = open(page, "rw")
soup = BeautifulSoup(stream, "lxml")
formsoup = soup.find('form', attrs={"method":u"post"})
if formsoup is not None:
action = soup.find('form', attrs={"method":u"post"}).findAll("action")
print "Action is => %s\n" % action
print ("Source: %s\ncode: %s\n\n\n\n\n" % (page, formsoup))
stream.close()
以下是我的结果:
Action is => []
Source: mysource.html
code: <form accept-charset="UTF-8" action="http://actionIshouldget.com/" id="user-login" method="post"><div><div class="form-item form-type-textfield form-item-name">
[... hidhing about ~20 lines that are useless for me]
这是我应该得到的结果:
Action is => http://actionIshouldget.com/
Source: mysource.html
code: <form accept-charset="UTF-8" action="http://actionIshouldget.com/" id="user-login" method="post"><div><div class="form-item form-type-textfield form-item-name">
[... hidhing about ~20 lines that are useless for me]
我没有设法使用for form in soup.find('form', attrs={"method":u"post"})
或正则表达式
答案 0 :(得分:0)
findAll()
会尝试在您拥有的结构中查找子元素,在您的情况下,它会搜索<action>
个元素。
你试过这个吗?
formsoup = soup.find('form', attrs={"method":u"post"})
formsoup['action']