The example below reflects data similar to what I'm using (I can't show my live data, due to company policy). It is pulled from this回答和this回答时获取链接和文字。
我的目标是提取<a>
元素的文本以及链接本身。
from lxml import html
post1 = """<p><code>Integer.parseInt</code> <em>couldn't</em> do the job, unless you were happy to lose data. Think about what you're asking for here.</p>

<p>Try <a href="http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29"><code>Long.parseLong(String)</code></a> or <a href="http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29"><code>new BigInteger(String)</code></a> for really big integers.</p>

"""
post2 = """
<p><code>Integer.parseInt</code> <em>couldn't</em> do the job, unless you were happy to lose data. Think about what you're asking for here.</p>

<p>Try <a href="http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29"><code>Long.parseLong(String)</code></a> or <a href="http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29"><code>new BigInteger(String)</code></a> for really big integers.</p>

"""
doc = html.fromstring(post1)
for link in doc.xpath('//a'):
print link.text, link.get('href')
不幸的是,这会返回以下内容:
None http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29
None http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29
请注意,我的link.text
为空。这是因为链接包装<code>
块。
如果我使用post2
,则会返回正确的结果:
PROJ.4 http://trac.osgeo.org/proj/
OpenSceneGraph http://www.openscenegraph.org/
如何修改循环以处理标准网址(post2
)和包含其他对象的链接(post1
)?
答案 0 :(得分:1)
更改
print link.text, link.get('href')
到
print link.text_content(), link.get('href')
然后你的输出将是
Long.parseLong(String) http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29
new BigInteger(String) http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29
对于post1
和post2
的请求。