使用这个python代码,我可以获得整个html源代码
import mechanize
import lxml.html
import StringIO
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [("User-agent","Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101206 Ubuntu/10.10 (maverick) Firefox/3.6.13")]
sign_in = br.open("http://target.co.uk")
#the login url
br.select_form(nr = 0)
#accessing form by their index.
#Since we have only one form in this example, nr =0.
br.select_form(nr=0)
#Alternatively you may use this instead of the above line
#if your form has name attribute available.
br["username"] = "myusername"
#the key "username" is the variable that takes the username/email value
br["password"] = "myp4sw0rd"
#the key "password" is the variable that takes the password value
logged_in = br.submit()
#submitting the login credentials
logincheck = logged_in.read()
#reading the page body that is redirected after successful login
if "logout" in logincheck:
print "Login success, you just logged in."
else:
print "Login failed"
#printing the body of the redirected url after login
coding1_content = br.open("https://www.target.co.uk/levels/coding/1").read()
#accessing other url(s) after login is done this way
tree = lxml.html.parse(io.StringIO(coding1_content)
for ta in tree.findall("//textarea"):
if not ta.get("name"):
print(ta.text)
if "textarea" in coding1_content:
print "Textarea found."
else:
print "Textarea not found."
但我需要的是获取第一个textarea
标签的内容,其中没有名称,我的html源代码如下
........
........
<textarea>this, is, what, i, want</textarea>
<textarea name="answer">i don't need it</textarea>
........
........
任何帮助将不胜感激
答案 0 :(得分:1)
根据lxml文档,您可以通过访问forms属性来访问html对象的表单:
form_page = fromstring('''some html code with a <form>''')
form = form_page.forms[0] # to get the first form
form.fields # these are the fields
在此处查看更多内容:http://lxml.de/lxmlhtml.html - &gt;表格
答案 1 :(得分:0)
如果HTML是
<html>
<body>
<form>
<textarea>this, is, what, i, want</textarea>
<textarea name="answer">i don't need it</textarea>
</form>
</body>
</html>
您可以获得textarea
这样的内容:
import io
import lxml.html
html = "..."
tree = lxml.html.parse(io.StringIO(html)
for ta in tree.findall("//textarea"):
if not ta.get("name"):
print(ta.text)
输出:
this, is, what, i, want
答案 2 :(得分:0)
让所有<textarea>
没有HTML属性name
的另一种可能方法,即使用xpath()
方法:
.....
for t in tree.xpath(".//textarea[not(@name)]"):
print t.text
虽然findall()
仅支持XPath语言的子集,但xpath()
具有完整的XPath 1.0支持。例如,正如此特定情况所示,xpath()
支持not()
而findall()
则不支持。{/ p>