Question

我想从网站解析xml，任何人都可以帮助我吗？

这是xml，我只想获取信息。

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>
http://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html
</loc>
<news:news>
<news:publication>
<news:name>Haber Gazete</news:name>
<news:language>tr</news:language>
</news:publication>
<news:publication_date>2015-01-29T15:04:01+02:00</news:publication_date>
<news:title>
ÇAYKUR 3 bin 500 personel alımı yapacağını duyurdu! (ÇAYKUR 3 bin 500 personel alım şarları)
</news:title>
</news:news>
<image:image>
<image:loc>
http://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg
</image:loc>
</image:image>
</url>

我尝试使用此代码进行解析，但它提供了null

conn = client.HTTPConnection("www.habergazete.com")
conn.request("GET", "/sitemaps/1/haberler.xml")
response =  conn.getresponse()
xmlData = response.read()
conn.close()
root = ET.fromstring(xmlData)
print(root.findall("loc"))

有什么建议吗？

谢谢：）

Answer 1

首先，您显示的XML格式不正确，因此解析它应该引发异常 - 它错过了最终结束'</urlset>'。我怀疑您并没有向我们展示您尝试解析的实际XML。

一旦你解决了这个问题（例如解析xmlData + '</urlset>'，如果XML数据实际上是以某种方式被截断的话），你就会遇到命名空间问题，这很容易显示：

>>> et.tostring(root)
b'<ns0:urlset xmlns:ns0="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:ns1="http://www.google.com/schemas/sitemap-news/0.9" xmlns:ns2="http://www.google.com/schemas/sitemap-image/1.1">\n<ns0:url>\n<ns0:loc>\nhttp://www.habergazete.com/haber-detay/1/69364/cAYKUR-3-bin-500-personel-alimi-yapacagini-duyurdu-cAYKUR-3-bin-500-personel-alim-sarlari--2015-01-29.html\n</ns0:loc>\n<ns1:news>\n<ns1:publication>\n<ns1:name>Haber Gazete</ns1:name>\n<ns1:language>tr</ns1:language>\n</ns1:publication>\n<ns1:publication_date>2015-01-29T15:04:01+02:00</ns1:publication_date>\n<ns1:title>\n&#199;AYKUR 3 bin 500 personel al&#305;m&#305; yapaca&#287;&#305;n&#305; duyurdu! (&#199;AYKUR 3 bin 500 personel al&#305;m &#351;arlar&#305;)\n</ns1:title>\n</ns1:news>\n<ns2:image>\n<ns2:loc>\nhttp://www.habergazete.com/resimler/haber/haber_detay/611x395-alim-54c8f335b176e-1422536816.jpg\n</ns2:loc>\n</ns2:image>\n</ns0:url></ns0:urlset>'

是的，这是一个非常长的字符串，但很早就你会看到：

<ns0:loc>

显示您正在寻找的loc实际上被小心地表示为名称空间0（ns0:前缀）。

第三，https://docs.python.org/2/library/xml.etree.elementtree.html的文档仔细解释，并引用：

Element.findall（）仅查找带有直接标记的元素儿童当前元素。

我的重点是：您只能找到urlset的直接子女标签，而不是通用后代的标签（儿童的孩子，依此类推）。

因此，扩展命名空间，并使用一点xpath语法递归搜索：

>>> root.findall('.//{http://www.sitemaps.org/schemas/sitemap/0.9}loc')
[<Element '{http://www.sitemaps.org/schemas/sitemap/0.9}loc' at 0x1022a50e8>]

...你终于找到了你想要的元素。

BTW，当我们不需要BeautifulSoup或{{1}的额外速度时，我们中的一些人会发现etree，http://www.crummy.com/software/BeautifulSoup/bs4/doc/更容易用于XML解析任务}。

我如何使用Python解析XML

1 个答案: