在两个封闭的标签之间获取文本XML - Python

时间:2016-08-22 01:52:58

标签: python xml

我下载了Foursquare数据,它采用KML格式。我正在使用Python将其解析为XML文件,并且无法弄清楚如何在关闭的标记和封闭的描述标记之间获取文本。 (这是我在办理登机手续时输入的文字,在下面的示例中,#14;最后在这里!!使用Sonya和co"但也有连字符)

这是数据的示例。

<Placemark>
  <name>hummus grill</name>
  <description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
  <updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
  <published>Tue, 24 Jan 12 17:14:00 +0000</published>
  <visibility>1</visibility>
  <Point>
    <extrude>1</extrude>
    <altitudeMode>relativeToGround</altitudeMode>
    <coordinates>-75.20104383595685,39.9528387056977</coordinates>
  </Point>
</Placemark>

到目前为止,我已经能够获得纬度/长度,发布日期,名称以及与此类似的代码链接:

latitudes = []
longitudes = []

for d in dom.getElementsByTagName('coordinates'):
    #Break them up into latitude and longitude
    coords = d.firstChild.data.split(',')
    longitudes.append(float(coords[0]))
    latitudes.append(float(coords[1]))

我试过这个(下面是数据的开头有这个标题的东西,但是还没弄明白如何处理它)

for d in dom.getElementsByTagName('description'):
    description.append(d.firstChild.data.encode('utf-8'))

<?xml version="1.0" encoding="UTF-8"?>
<kml><Folder><name>foursquare checkin history </name><description>foursquare checkin history </description>:

然后通过这个d.firstChild.nextSibling.firstChild.data.encode(&#39; utf-8&#39;)访问它,但它只是给了我&#34; hummus grill&#34;,我和我#39; m假设是a标签之间的文本(而不是名称标签)。

2 个答案:

答案 0 :(得分:0)

您是否尝试过使用子字符串?

让我们说你的所有xml都在变量&#34; foo&#34;例如。

foo = '<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>'

您可以通过打印以下内容来提取此数据。

foo[foo.index('</a>')+4:foo.index('</description>')]

这可以给你你想要的东西。

- FINALLY HERE!! With Sonya and co

只需阅读子字符串,您就可以更轻松地操作文本。

答案 1 :(得分:0)

以下适用于我:

exec task

或者,如果您想要描述标记中的整个文本:

In [44]: description = []

In [45]: for d in dom.getElementsByTagName('description'):
   ....:     description.append(d.firstChild.nextSibling.nextSibling.data.encode('utf-8'))
   ....:     

In [46]: description
Out[46]: ['- FINALLY HERE!! With Sonya and co']

这将打印:from xml.dom.minidom import parse, parseString def getText(node, recursive = False): """ Get all the text associated with this node. With recursive == True, all text from child nodes is retrieved """ L = [''] for n in node.childNodes: if n.nodeType in (dom.TEXT_NODE, dom.CDATA_SECTION_NODE): L.append(n.data) else: if not recursive: return None L.append(getText(n)) return ''.join(L) dom = parseString("""<Placemark> <name>hummus grill</name> <description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description> <updated>Tue, 24 Jan 12 17:14:00 +0000</updated> <published>Tue, 24 Jan 12 17:14:00 +0000</published> <visibility>1</visibility> <Point> <extrude>1</extrude> <altitudeMode>relativeToGround</altitudeMode> <coordinates>-75.20104383595685,39.9528387056977</coordinates> </Point> </Placemark>""") description = [] for d in dom.getElementsByTagName('description'): description.append(getText(d, recursive = True)) print description