BeautifulSoup从课堂上获得价值

时间:2015-12-07 13:33:06

标签: python html css regex beautifulsoup

我遇到了BeautifulSoup的麻烦。 这就是我想要做的事情:

  

对于我读过的每个html页面中的每个表单,我想获取“action”指向的URL。

这是我的代码:

def myfunction(path)
    from bs4 import BeautifulSoup

    #Retrieve htmlFiles from a folder
    pages = find_files(path, '.html') #as a list
    for page in pages:
        stream = open(page, "rw")
        soup = BeautifulSoup(stream, "lxml")
        formsoup = soup.find('form', attrs={"method":u"post"})
        if formsoup is not None:
           action = soup.find('form', attrs={"method":u"post"}).findAll("action") 
           print "Action is => %s\n" % action
           print ("Source: %s\ncode: %s\n\n\n\n\n" % (page, formsoup))
    stream.close()

以下是我的结果:

Action is => []

    Source: mysource.html
    code: <form accept-charset="UTF-8" action="http://actionIshouldget.com/" id="user-login" method="post"><div><div class="form-item form-type-textfield form-item-name">
[... hidhing about ~20 lines that are useless for me]

这是我应该得到的结果:

Action is => http://actionIshouldget.com/

    Source: mysource.html
    code: <form accept-charset="UTF-8" action="http://actionIshouldget.com/" id="user-login" method="post"><div><div class="form-item form-type-textfield form-item-name">
[... hidhing about ~20 lines that are useless for me]

我没有设法使用for form in soup.find('form', attrs={"method":u"post"})或正则表达式

1 个答案:

答案 0 :(得分:0)

findAll()会尝试在您拥有的结构中查找子元素,在您的情况下,它会搜索<action>个元素。

你试过这个吗?

formsoup = soup.find('form', attrs={"method":u"post"})
formsoup['action']