我下载了Foursquare数据,它采用KML格式。我正在使用Python将其解析为XML文件,并且无法弄清楚如何在关闭的标记和封闭的描述标记之间获取文本。 (这是我在办理登机手续时输入的文字,在下面的示例中,#14;最后在这里!!使用Sonya和co"但也有连字符)
这是数据的示例。
<Placemark>
<name>hummus grill</name>
<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
<updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
<published>Tue, 24 Jan 12 17:14:00 +0000</published>
<visibility>1</visibility>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-75.20104383595685,39.9528387056977</coordinates>
</Point>
</Placemark>
到目前为止,我已经能够获得纬度/长度,发布日期,名称以及与此类似的代码链接:
latitudes = []
longitudes = []
for d in dom.getElementsByTagName('coordinates'):
#Break them up into latitude and longitude
coords = d.firstChild.data.split(',')
longitudes.append(float(coords[0]))
latitudes.append(float(coords[1]))
我试过这个(下面是数据的开头有这个标题的东西,但是还没弄明白如何处理它)
for d in dom.getElementsByTagName('description'):
description.append(d.firstChild.data.encode('utf-8'))
<?xml version="1.0" encoding="UTF-8"?>
<kml><Folder><name>foursquare checkin history </name><description>foursquare checkin history </description>:
然后通过这个d.firstChild.nextSibling.firstChild.data.encode(&#39; utf-8&#39;)访问它,但它只是给了我&#34; hummus grill&#34;,我和我#39; m假设是a标签之间的文本(而不是名称标签)。
答案 0 :(得分:0)
您是否尝试过使用子字符串?
让我们说你的所有xml都在变量&#34; foo&#34;例如。
foo = '<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>'
您可以通过打印以下内容来提取此数据。
foo[foo.index('</a>')+4:foo.index('</description>')]
这可以给你你想要的东西。
- FINALLY HERE!! With Sonya and co
只需阅读子字符串,您就可以更轻松地操作文本。
答案 1 :(得分:0)
以下适用于我:
exec task
或者,如果您想要描述标记中的整个文本:
In [44]: description = []
In [45]: for d in dom.getElementsByTagName('description'):
....: description.append(d.firstChild.nextSibling.nextSibling.data.encode('utf-8'))
....:
In [46]: description
Out[46]: ['- FINALLY HERE!! With Sonya and co']
这将打印:from xml.dom.minidom import parse, parseString
def getText(node, recursive = False):
"""
Get all the text associated with this node.
With recursive == True, all text from child nodes is retrieved
"""
L = ['']
for n in node.childNodes:
if n.nodeType in (dom.TEXT_NODE, dom.CDATA_SECTION_NODE):
L.append(n.data)
else:
if not recursive:
return None
L.append(getText(n))
return ''.join(L)
dom = parseString("""<Placemark>
<name>hummus grill</name>
<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
<updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
<published>Tue, 24 Jan 12 17:14:00 +0000</published>
<visibility>1</visibility>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-75.20104383595685,39.9528387056977</coordinates>
</Point>
</Placemark>""")
description = []
for d in dom.getElementsByTagName('description'):
description.append(getText(d, recursive = True))
print description