Question

我下载了Foursquare数据，它采用KML格式。我正在使用Python将其解析为XML文件，并且无法弄清楚如何在关闭的标记和封闭的描述标记之间获取文本。（这是我在办理登机手续时输入的文字，在下面的示例中，＃14;最后在这里!!使用Sonya和co＆＃34;但也有连字符）

这是数据的示例。

<Placemark>
  <name>hummus grill</name>
  <description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
  <updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
  <published>Tue, 24 Jan 12 17:14:00 +0000</published>
  <visibility>1</visibility>
  <Point>
    <extrude>1</extrude>
    <altitudeMode>relativeToGround</altitudeMode>
    <coordinates>-75.20104383595685,39.9528387056977</coordinates>
  </Point>
</Placemark>

到目前为止，我已经能够获得纬度/长度，发布日期，名称以及与此类似的代码链接：

latitudes = []
longitudes = []

for d in dom.getElementsByTagName('coordinates'):
    #Break them up into latitude and longitude
    coords = d.firstChild.data.split(',')
    longitudes.append(float(coords[0]))
    latitudes.append(float(coords[1]))

我试过这个（下面是数据的开头有这个标题的东西，但是还没弄明白如何处理它）

for d in dom.getElementsByTagName('description'):
    description.append(d.firstChild.data.encode('utf-8'))

<?xml version="1.0" encoding="UTF-8"?>
<kml><Folder><name>foursquare checkin history </name><description>foursquare checkin history </description>:

然后通过这个d.firstChild.nextSibling.firstChild.data.encode（＆＃39; utf-8＆＃39;）访问它，但它只是给了我＆＃34; hummus grill＆＃34;，我和我＃39; m假设是a标签之间的文本（而不是名称标签）。

Answer 1

您是否尝试过使用子字符串？

让我们说你的所有xml都在变量＆＃34; foo＆＃34;例如。

foo = '<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>'

您可以通过打印以下内容来提取此数据。

foo[foo.index('</a>')+4:foo.index('</description>')]

这可以给你你想要的东西。

- FINALLY HERE!! With Sonya and co

只需阅读子字符串，您就可以更轻松地操作文本。

Answer 2

以下适用于我：

exec task

或者，如果您想要描述标记中的整个文本：

In [44]: description = []

In [45]: for d in dom.getElementsByTagName('description'):
   ....:     description.append(d.firstChild.nextSibling.nextSibling.data.encode('utf-8'))
   ....:     

In [46]: description
Out[46]: ['- FINALLY HERE!! With Sonya and co']

这将打印：from xml.dom.minidom import parse, parseString def getText(node, recursive = False): """ Get all the text associated with this node. With recursive == True, all text from child nodes is retrieved """ L = [''] for n in node.childNodes: if n.nodeType in (dom.TEXT_NODE, dom.CDATA_SECTION_NODE): L.append(n.data) else: if not recursive: return None L.append(getText(n)) return ''.join(L) dom = parseString("""<Placemark> <name>hummus grill</name> <description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description> <updated>Tue, 24 Jan 12 17:14:00 +0000</updated> <published>Tue, 24 Jan 12 17:14:00 +0000</published> <visibility>1</visibility> <Point> <extrude>1</extrude> <altitudeMode>relativeToGround</altitudeMode> <coordinates>-75.20104383595685,39.9528387056977</coordinates> </Point> </Placemark>""") description = [] for d in dom.getElementsByTagName('description'): description.append(getText(d, recursive = True)) print description

在两个封闭的标签之间获取文本XML - Python

2 个答案: