我正在尝试使用lxml解析HTML表。虽然rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
获取结果,但我只是在我的配置文件中以变量开头时才尝试提取列内容。例如,如果<td>
以“街道1”开头,那么我想要获取该<span>
标记的<td>
内容。这样,我可以有一个元组元组(它处理None值),然后我可以将它存储在数据库中。
lxml_parse.py
import lxml.html as lh
doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()
rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
print rows
TEST.HTM
<tr>
<td></td>
<td colspan="2">
Street 1:<span class="required"> *</span><br />
<span class="boldred">2100 5th Ave</span>
</td>
<td colspan="2">
Street 2:<br />
<span class="boldred">Ste 202</span>
</td>
</tr>
<tr>
<td></td>
<td>
City:<span class="required"> *</span><br />
<span class="boldred">NYC</span>
</td>
<td>
State:<br />
<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>
</td>
<td>
Country:<span class="required"> *</span><br />
<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>
</td>
<td>
Zip:<br />
<span class="boldred">10022</span>
</td>
</tr>
输出
$ python lxml_parse.py
['2100 5th Ave', 'Ste 202', 'NYC', 'NY', 'USA', '10022']
解析一堆变量是我遇到的问题:
import lxml.html as lh
desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']
doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()
myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars)
print myresultset
答案 0 :(得分:1)
旨在制作这本词典:
{'City:': 'NYC',
'Zip:': '10022',
'Street 1:': '2100 5th Ave',
'Country:': 'USA',
'State:': 'NY',
'Street 2:': 'Ste 202'}
您可以使用此代码。然后很容易查询字典以获得您想要的值:
import lxml.html as lh
test = '''<tr>
<td></td>
<td colspan="2">
Street 1:<span class="required"> *</span><br />
<span class="boldred">2100 5th Ave</span>
</td>
<td colspan="2">
Street 2:<br />
<span class="boldred">Ste 202</span>
</td>
</tr>
<tr>
<td></td>
<td>
City:<span class="required"> *</span><br />
<span class="boldred">NYC</span>
</td>
<td>
State:<br />
<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>
</td>
<td>
Country:<span class="required"> *</span><br />
<SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>
</td>
<td>
Zip:<br />
<span class="boldred">10022</span>
</td>
</tr>'''
outhtml = lh.fromstring(test)
ks = [ k.strip() for k in outhtml.xpath('//tr/td/text()') if k.strip() != '' ]
vs = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
result = dict( zip(ks,vs) )
print result
答案 1 :(得分:0)
lxml_tempsofsol.py :
import lxml.html as lh
desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']
doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()
myresultset = ((var, outhtml.xpath('//tr/td[contains(text(), "%s")]/span[@class="boldred"]/text()'%(var))[0]) for var in desiredvars)
for each in myresultset:
print each
输出
$ python lxml_tempsofsol.py
('Street 1', '2100 5th Ave')
('Street 2', 'Ste 202')
('City', 'NYC')
('State', 'NY')
('Zip', '10022')
答案 2 :(得分:0)
我搜索过相同的内容,发现了你的问题并没有“正确”的答案,所以我会补充几点:
child::*
是错误的,因为您直接在<td/>
内搜索文字; text()
已经搜索了文本子节点,考虑到这些,您的更正后的代码如下所示:
import lxml.html as lh
desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']
doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()
myresultset = [(var, outhtml.xpath('//tr/td[contains(text(), $var)]/span[@class="boldred"]/text()', var=var)) for var in desiredvars]
print myresultset