我试图从网址调用解析html输出,但我发现自己在如何使其工作方面苦苦挣扎。
我使用以下代码:
import urllib2
import base64 as b64
import lxml.html as LH
request = urllib2.Request('http://%s%s' % (fInput[1], fInput[2]))
base64string = b64.encodestring('%s:%s' % (fInput[3], fInput[4])).replace('\n', '')
request.add_header("Authorization", "Basic %s" % base64string)
response = urllib2.urlopen(request)
html = response.read()
root = LH.fromstring(html)
sibling_content = lambda x: [b.getparent().getnext().text_content() for b in root.cssselect("td b:contains('{0}')".format(x))]
fields = ['groupList','namelist']
for result in zip(*[sibling_content(field) for field in fields]):
print result
我打印输出时的结果是:
('Admins', '\nme\nmyself\nirene')('guests', '\nhin\nhinself\nbacon')
适合我需要的输出有一个这样的数组,所以我可以将它插入数据库中:
['Admins', 'me','myself','nirene'],['guests', 'hin','hinself','bacon']
FOllowing,即时发送HTTP请求返回的Html:
<BODY bgcolor="#dddddd">
<TABLE bgcolor="#dddddd" border="1">
<TR>
<TD valign="top"><B>MainList</B></TD>
<TD>
<TABLE>
<TR>
<TD>
<TABLE bgcolor="#dddddd" border="1">
<TR>
<TD valign="top"><B>groupList</B></TD>
<TD>Admins</TD>
</TR>
<TR>
<TD valign="top"><B>namelist</B></TD>
<TD>
<TABLE>
<TR>
<TD>me</TD>
</TR>
<TR>
<TD>myself</TD>
</TR>
<TR>
<TD>irene</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
<TABLE bgcolor="#dddddd" border="1">
<TR>
<TD valign="top"><B>groupList</B></TD>
<TD>guests</TD>
</TR>
<TR>
<TD valign="top"><B>namelist</B></TD>
<TD>
<TABLE>
<TR>
<TD>hin</TD>
</TR>
<TR>
<TD>hinself</TD>
</TR>
<TR>
<TD>bacon</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
</BODY>
关于如何让这个工作的任何想法?
提前致谢。
答案 0 :(得分:3)
我对你的问题感到有些困惑。你只是问如何转向
('Members', '\nme\nmyself\nirene')('Members_2', '\nhin\nhinself\nbacon')
到
['Members', 'me','myself','nirene'],['Members_2', 'hin','hinself','bacon']
这很简单:
>>> x = [('Members', '\nme\nmyself\nirene'), ('Members_2', '\nhin\nhinself\nbacon')]
>>> [[y[0]] + y[1].splitlines()[1:] for y in x]
仅供参考,您应该查看requests
Python模块。而不是所有的urllib2
消化文章,它让你只需写
requests.get(url, auth=(user, pass))
答案 1 :(得分:1)
使用xpath
:
root=LH.fromstring(html)
[t.xpath('.//td[not(contains(.,"\n"))]/text()')
for t in root.xpath('.//table[@bgcolor="#dddddd"]')]
出:
[['Admins', 'me', 'myself', 'irene'], ['guests', 'hin', 'hinself', 'bacon']]
答案 2 :(得分:0)
我认为从文本内容中剥离空白应该有效。因此,在您的代码中,您可以为此行添加一个条带调用:
sibling_content = lambda x:[b.getparent()。getnext()。text_content() .strip() for root.cssselect中的b(“td b:contains('{0} “)”。格式(X))]