我尝试使用以下代码从html输出中检索所有数据:
proxy_auth = "http://"+proxyUser+":"+proxyPass+"@"+proxyHost
proxy_handler = urllib2.ProxyHandler({"http": proxy_auth})
opener = urllib2.build_opener(proxy_handler)
opener = urllib2.build_opener()
urllib2.install_opener(opener)
request = urllib2.Request("http://"+iserver+"/invoke/pub.art/listRegisteredAdapters")
base64string = base64.encodestring('%s:%s' % (login, password)).replace('\n', '')
request.add_header("Authorization", "Basic %s" % base64string)
response = urllib2.urlopen(request)
html = response.read()
doc = LH.fromstring(html)
tds = (td.text_content() for td in doc.xpath("//td"))
print html
for td, val in zip(*[tds]*2):
if td == "adapterTypeName" :
adapterTypeName=val
print adapterTypeName
这是原始的html输出,
<BODY bgcolor=#dddddd>
<TABLE bgcolor=#dddddd border=1>
<TR>
<TD valign="top"><B>registeredAdapterList</B></TD>
<TD>
<TABLE>
<TR>
<TD><TABLE bgcolor=#dddddd border=1>
<TR>
<TD valign="top"><B>adapterTypeName</B></TD>
<TD>SAPAdapter</TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD><TABLE bgcolor=#dddddd border=1>
<TR>
<TD valign="top"><B>adapterTypeName</B></TD>
<TD>SMSCAdapter</TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD><TABLE bgcolor=#dddddd border=1>
<TR>
<TD valign="top"><B>adapterTypeName</B></TD>
<TD>PRTServerAdapter</TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD><TABLE bgcolor=#dddddd border=1>
<TR>
<TD valign="top"><B>adapterTypeName</B></TD>
<TD>com.vf.bdp.BDPAdapter</TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD><TABLE bgcolor=#dddddd border=1>
<TR>
<TD valign="top"><B>adapterTypeName</B></TD>
<TD>SiebelAdapter</TD>
</TR>
</TABLE>
</TD>
</TR>
<TR>
<TD><TABLE bgcolor=#dddddd border=1>
<TR>
<TD valign="top"><B>adapterTypeName</B></TD>
<TD>JDBCAdapter</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
</BODY>
我期待的是检索以下字段:
SAPAdapter
SMSCAdapter
PRTServerAdapter
com.vf.bdp.BDPAdapter
SiebelAdapter
JDBCAdapter
相反,我只收到:
SMSCAdapter
com.vf.bdp.BDPAdapter
JDBCAdapter
由于我是Python新手,我不知道这里可能出现什么问题。
答案 0 :(得分:1)
你的问题在于xpath表达式,这太过宽松了 它找到了你实际上不想找到的元素 尝试打印结果,看看我在说什么。
在我看来,你想要找到所有不包含子元素的td元素的文本 一个简单的方法是:
doc = LH.fromstring(html)
for td in doc.xpath('//td[not(*)]/text()'):
print td