我正在学习XPath并希望提取嵌入在以下HTML中的URL。我尝试了@"//table[contains(@option, 'value')]"
的变体但没有成功。
<body>
<div id="Wrapper">
<div id="header">
<span id="logoHolder">
<a href="http://www.foo.com">
<img src="/templates/blank_j15/images/nexus_logo.png" width="167" height="65" border="0"/>
</a>
</span>
<span style="float: left; padding-top: 27px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; color: rgb(0, 182, 222); ">Embracing Diversity. Challenging Minds.</span>
<span id="searchHolder">
<div style="clear: both; "/>
<div id="IE_P_space"/>
<div id="arttotalmenucontent_138" class="hidden">
<script type="text/javascript">
<table cellspacing="0" cellpadding="0" border="0" width="100%" id="wrapper_cont_table">
<tbody>
<tr>
<tr>
<tr>
<td valign="top" id="wrapper_cont_leftNav">
<div class="leftnavCont">
<p>
<select onchange="nl(this.value)" size="8">
<option value="/images/download/newsletter/connect04_300911.pdf">Connect 04: 30/09/2011</option>
<option value="/images/download/newsletter/connect03_230911.pdf">Connect 03: 23/09/2011</option>
<option value="/images/download/newsletter/connect02_150911.pdf">Connect 02: 15/09/2011</option>
<option value="/images/download/newsletter/connect01_120911.pdf">Connect 01: 12/09/2011</option>
</p>
答案 0 :(得分:1)
//p/select/option/@value
似乎适合我。
我认为使用xpath库时一定存在问题。我花了很长时间才找到你的样品来源。
这是我的首选xml库的一个工作示例。
#!/usr/bin/env python
import os
from urllib2 import urlopen
from lxml import etree
filename = 'sample.html'
url = 'http://www.foo.example/index.php?option=com_content&view=article&id=186&Itemid=301'
# Some simple caching for a test script...
if os.path.exists(filename):
with open(filename,'r') as f:
data = f.read()
else:
data = urlopen(url).read()
with open(filename,'w') as f:
f.write(data)
doc = etree.HTML(data)
for v in doc.xpath('//p/select/option/@value'):
print v
产地:
/images/download/newsletter/connect04_300911.pdf /images/download/newsletter/connect03_230911.pdf /images/download/newsletter/connect02_150911.pdf /images/download/newsletter/connect01_120911.pdf