Question

我正在运行xpath来过滤带有“item”标签的XML Feed。从结果列表中，我获取第一个结果并使用xpath过滤“title”标记。但是，当我过滤“标题”时，我从xml获得一个没有“item”标签的标题。由于我在“item”结果集上执行xpath，因此行为是意外的。谁能告诉我这里发生了什么。

使用xpath查看以下代码。

from urllib.request import urlopen
from lxml import etree
url = 'https://www.sec.gov/Archives/edgar/monthly/xbrlrss-2018-02.xml'
data = urlopen(url)
xml = data.read()
parser = etree.XMLParser(remove_blank_text=True, huge_tree=True)
root = etree.XML(xml, parser=parser)
items = root.xpath("//item")
first_item = items[0]
title = first_item.xpath("//title")[0].text
print(title)
#'All XBRL Data Submitted to the SEC for 2018-02'

我预计第一项如下：

<item>
<title>DST SYSTEMS INC (0000714603) (Filer)</title>
<link>http://www.sec.gov/Archives/edgar/data/714603/000071460318000013/0000714603-18-000013-index.htm</link>
<guid>http://www.sec.gov/Archives/edgar/data/714603/000071460318000013/0000714603-18-000013-xbrl.zip</guid>
<enclosure url="http://www.sec.gov/Archives/edgar/data/714603/000071460318000013/0000714603-18-000013-xbrl.zip" length="470442" type="application/zip" />
<description>10-K</description>
<pubDate>Wed, 28 Feb 2018 17:29:39 EST</pubDate>
<edgar:xbrlFiling xmlns:edgar="http://www.sec.gov/Archives/edgar"></item>

相反，当我这样做时： title = first_item.xpath（“// title”）。text，我得到标题为''所有XBRL数据提交给SEC 2018-02'

标题来自：

<channel>
<title>All XBRL Data Submitted to the SEC for 2018-02</title>
<link>http://www.sec.gov/spotlight/xbrl/filings-and-feeds.shtml</link>
<atom:link xmlns:atom="http://www.w3.org/2005/Atom" href="http://www.sec.gov/Archives/edgar/monthly/xbrlrss-2018-02.xml" rel="self" type="application/rss+xml" />
<description>This is a list all of the filings containing XBRL for 2018-02</description>
<language>en-us</language>
<pubDate>Wed, 28 Feb 2018 00:00:00 EST</pubDate>
<lastBuildDate>Wed, 28 Feb 2018 00:00:00 EST</lastBuildDate>

但是我在项目上运行了xpath，它执行了xpath（“items”）。我不知道为什么我没有得到'DST SYSTEMS INC（0000714603）（Filer）'的预期结果。

Answer 1

而不是：

title = first_item.xpath("//title")[0].text

使用：

title = first_item.xpath("title")[0].text

区别在于＆＃34; //＆＃34;之前＆＃34;标题＆＃34;。

原因是＆＃34; //标题＆＃34;选择所有标题元素，无论它们在文档中的位置。只需使用＆＃34; title＆＃34;将选择名称为＆＃34; title＆＃34;。

的节点

etree元素上的xpath产生意外结果

1 个答案: