我正在解析XML文档,并使用XPath读取不同元素的值。当前,这很适合将所有元素都包含在列表中。 但是,子元素并不总是对所有父元素都存在(但在某些父元素中也存在!),我在解析xml以创建要插入数据库的数据框时需要知道哪个元素。 因此,我想遍历元素并一次获取我需要的值。我不确定如何执行此操作,因为当前每次迭代我都会获得完整列表。 我正在提取嵌套在不同级别的元素。
我正在解析的xml是Garmin的TCX文件。简短示例:
<?xml version="1.0" encoding="UTF-8"?>
<TrainingCenterDatabase
xsi:schemaLocation="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2 http://www.garmin.com/xmlschemas/TrainingCenterDatabasev2.xsd"
xmlns:ns5="http://www.garmin.com/xmlschemas/ActivityGoals/v1"
xmlns:ns3="http://www.garmin.com/xmlschemas/ActivityExtension/v2"
xmlns:ns2="http://www.garmin.com/xmlschemas/UserProfile/v2"
xmlns="http://www.garmin.com/xmlschemas/TrainingCenterDatabase/v2"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns4="http://www.garmin.com/xmlschemas/ProfileExtension/v1">
<Activities>
<Activity Sport="Running">
<Id>2018-10-10T14:10:10.000Z</Id>
<Lap StartTime="2018-10-10T14:10:10.000Z">
<TotalTimeSeconds>343.0</TotalTimeSeconds>
<DistanceMeters>1000.0</DistanceMeters>
<MaximumSpeed>3.694999933242798</MaximumSpeed>
<Calories>51</Calories>
<AverageHeartRateBpm>
<Value>136</Value>
</AverageHeartRateBpm>
<MaximumHeartRateBpm>
<Value>162</Value>
</MaximumHeartRateBpm>
<Intensity>Active</Intensity>
<TriggerMethod>Manual</TriggerMethod>
<Track>
<Trackpoint>
<Time>2018-10-10T14:10:10.000Z</Time>
<Position>
<LatitudeDegrees>52.17917550355196</LatitudeDegrees>
<LongitudeDegrees>6.532441098242998</LongitudeDegrees>
</Position>
<AltitudeMeters>-0.20000000298023224</AltitudeMeters>
<DistanceMeters>0.0</DistanceMeters>
<HeartRateBpm>
<Value>94</Value>
</HeartRateBpm>
<Extensions>
<ns3:TPX>
<ns3:Speed>0.04699999839067459</ns3:Speed>
<ns3:RunCadence>7</ns3:RunCadence>
</ns3:TPX>
</Extensions>
</Trackpoint>
<Trackpoint>
<Time>2018-10-10T14:10:11.000Z</Time>
<Position>
<LatitudeDegrees>52.17917634174228</LatitudeDegrees>
<LongitudeDegrees>6.532444199547172</LongitudeDegrees>
</Position>
<AltitudeMeters>0.0</AltitudeMeters>
<DistanceMeters>0.23000000417232513</DistanceMeters>
<HeartRateBpm>
<Value>95</Value>
</HeartRateBpm>
<Extensions>
<ns3:TPX>
<ns3:Speed>0.0</ns3:Speed>
<ns3:RunCadence>7</ns3:RunCadence>
</ns3:TPX>
</Extensions>
</Trackpoint>
<Trackpoint>
<Time>2018-10-10T14:10:12.000Z</Time>
<Position>
<LatitudeDegrees>52.17917206697166</LatitudeDegrees>
<LongitudeDegrees>6.532468926161528</LongitudeDegrees>
</Position>
<AltitudeMeters>0.0</AltitudeMeters>
<DistanceMeters>1.9700000286102295</DistanceMeters>
<Extensions>
<ns3:TPX>
<ns3:Speed>0.0</ns3:Speed>
<ns3:RunCadence>7</ns3:RunCadence>
</ns3:TPX>
</Extensions>
</Trackpoint>
<Trackpoint>
<Time>2018-10-10T14:10:13.000Z</Time>
<Position>
<LatitudeDegrees>52.17916024848819</LatitudeDegrees>
<LongitudeDegrees>6.5325202234089375</LongitudeDegrees>
</Position>
<AltitudeMeters>0.0</AltitudeMeters>
<DistanceMeters>5.679999828338623</DistanceMeters>
<HeartRateBpm>
<Value>96</Value>
</HeartRateBpm>
<Extensions>
<ns3:TPX>
<ns3:Speed>0.08399999886751175</ns3:Speed>
<ns3:RunCadence>7</ns3:RunCadence>
</ns3:TPX>
</Extensions>
</Trackpoint>
<Trackpoint>
<Time>2018-10-10T14:10:14.000Z</Time>
<Position>
<LatitudeDegrees>52.17914817854762</LatitudeDegrees>
<LongitudeDegrees>6.532532041892409</LongitudeDegrees>
</Position>
<AltitudeMeters>0.0</AltitudeMeters>
<DistanceMeters>7.150000095367432</DistanceMeters>
<HeartRateBpm>
<Value>98</Value>
</HeartRateBpm>
<Extensions>
<ns3:TPX>
<ns3:Speed>0.10300000011920929</ns3:Speed>
<ns3:RunCadence>10</ns3:RunCadence>
</ns3:TPX>
</Extensions>
</Trackpoint>
正在运行的代码将文件中的所有值都显示为列表:
from lxml import etree, objectify
from os import listdir
from os.path import isfile, join
def tcxParse(tcxFile):
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(tcxFile, parser)
root = tree.getroot()
####
#strip namespaces
for elem in root.getiterator():
if not hasattr(elem.tag, 'find'): continue # (1)
i = elem.tag.find('}')
if i >= 0:
elem.tag = elem.tag[i + 1:]
objectify.deannotate(root, cleanup_namespaces=True)
####
#check if we are dealing with .tcx or other format
if tcxFile.lower().endswith('.tcx'):
tcxParse.activity = tree.xpath('//*[@Sport]/@Sport')
tcxParse.HR = list(map(int, tree.xpath('//Track/Trackpoint/HeartRateBpm/Value/text()')))
tcxParse.Time = tree.xpath('//Time/text()')
tcxParse.Speed = list(map(float, tree.xpath('//Track/Trackpoint/Extensions/TPX/Speed/text()')))
tcxParse.Cadence = list(map(int, tree.xpath('//Track/Trackpoint/Extensions/TPX/RunCadence/text()')))
tcxParse.Lat = list(map(float, tree.xpath('//Track/Trackpoint/Position/LatitudeDegrees/text()')))
tcxParse.Lon = list(map(float, tree.xpath('//Track/Trackpoint/Position/LongitudeDegrees/text()')))
tcxParse.Alt = list(map(float, tree.xpath('//Track/Trackpoint/AltitudeMeters/text()')))
tcxParse.Distance = list(map(float, tree.xpath('//Track/Trackpoint/DistanceMeters/text()')))
我知道我可以使用tree.iter()遍历元素,但不确定如何一次获取一个值而不是整个列表。
要明确: 例如,tcxParse.HR的当前输出为:
94,95,96,98
但我需要成为现实
94,95,nan,96,98
在第三个Trackpoint元素中缺少HeartRateBpm
答案 0 :(得分:1)
据我了解,您需要迭代<Trackpoint>
中的<Track>
。
我建议您这样做:
trackpoints = [{
'HR': tp.findtext('HeartRateBpm/Value'),
'Time': tp.findtext('Time'),
'Speed': tp.findtext('Extensions/TPX/Speed'),
'Cadence': tp.findtext('Extensions/TPX/RunCadence'),
'Lat': tp.findtext('Position/LatitudeDegrees'),
'Lon': tp.findtext('Position/LongitudeDegrees'),
'Alt': tp.findtext('AltitudeMeters'),
'Distance': tp.findtext('DistanceMeters')
}
for tp in tree.xpath('//Track/Trackpoint')]
对于有问题的xml块(第二个<HeartRateBpm>
中删除了<Trackpoint>
)-trackpoints
将包含以下列表:
[{'HR': '94', 'Time': '2018-10-10T14:10:10.000Z', 'Speed': '0.04699999839067459', 'Cadence': '7', 'Lat': '52.17917550355196', 'Lon': '6.532441098242998', 'Alt': '-0.20000000298023224', 'Distance': '0.0'},
{'HR': None, 'Time': '2018-10-10T14:10:11.000Z', 'Speed': '0.0', 'Cadence': '7', 'Lat': '52.17917634174228', 'Lon': '6.532444199547172', 'Alt': '0.0', 'Distance': '0.23000000417232513'}]