Question

The example below reflects data similar to what I'm using (I can't show my live data, due to company policy). It is pulled from this回答和this回答时获取链接和文字。

我的目标是提取<a>元素的文本以及链接本身。

from lxml import html

post1 = """<p><code>Integer.parseInt</code> <em>couldn't</em> do the job, unless you were happy to lose data. Think about what you're asking for here.</p>&#xA;&#xA;<p>Try <a href="http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29"><code>Long.parseLong(String)</code></a> or <a href="http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29"><code>new BigInteger(String)</code></a> for really big integers.</p>&#xA;
"""

post2 = """
<p><code>Integer.parseInt</code> <em>couldn't</em> do the job, unless you were happy to lose data. Think about what you're asking for here.</p>&#xA;&#xA;<p>Try <a href="http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29"><code>Long.parseLong(String)</code></a> or <a href="http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29"><code>new BigInteger(String)</code></a> for really big integers.</p>&#xA;
"""
doc = html.fromstring(post1)
for link in doc.xpath('//a'):
    print link.text, link.get('href')

不幸的是，这会返回以下内容：

None http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29
None http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29

请注意，我的link.text为空。这是因为链接包装<code>块。

如果我使用post2，则会返回正确的结果：

PROJ.4 http://trac.osgeo.org/proj/
OpenSceneGraph http://www.openscenegraph.org/

如何修改循环以处理标准网址（post2）和包含其他对象的链接（post1）？

Answer 1

更改

print link.text, link.get('href')

到

print link.text_content(), link.get('href')

然后你的输出将是

Long.parseLong(String) http://docs.oracle.com/javase/7/docs/api/java/lang/Long.html#parseLong%28java.lang.String%29
new BigInteger(String) http://docs.oracle.com/javase/7/docs/api/java/math/BigInteger.html#BigInteger%28java.lang.String%29

对于post1和post2的请求。

如何在<a> wraps another element using XPath?

1 个答案: