如果字符数超过给定阈值,则显示XML元素值

时间:2014-06-15 10:46:10

标签: python xml awk xml-parsing

我有一堆大型XML文档,其中包含按以下方式排列的地理空间信息(KML,如果有人感兴趣的话):

<Placemark><SimpleData name="species">Unique number</SimpleData> ... coordinates</Placemark>

我想列出所有物种ID,其中Placemark标签之间的总字符数超过给定阈值 - 1,000,000。以下AWK脚本指示哪些行突破了限制:

for kmlfile in *.kml; do
    echo "Processing $kmlfile"
    awk -- '/<Placemark>/,/<\/Placemark>/ { if (length() > 10000) { printf("Line %d has %d characters\n", NR, length()); } }' $kmlfile
done

但我不知道如何让它显示物种ID而不是行号。有任何想法如何使它成为你喜欢的AWK,Python或其他任何东西?

以下是文档的样子:

<Document xmlns="http://www.opengis.net/kml/2.2">
    <Folder><name>Export_Output02</name>
        <Placemark>
            <Style><LineStyle><color>ff0000ff</color></LineStyle><PolyStyle><fill>0</fill></PolyStyle></Style>
            <ExtendedData><SchemaData schemaUrl="#Export_Output02">
                <SimpleData name="species">1312</SimpleData>
                <SimpleData name="area">7848012</SimpleData>
                <SimpleData name="irrep_area">0.00000012742</SimpleData>
                <SimpleData name="groupID">2</SimpleData>
            </SchemaData></ExtendedData>
            <MultiGeometry>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>-57.843052746056827,-33.032934004012787 -57.825312079170494,-33.089724736921667 -57.888494029914156,-33.073777852969904 -57.843052746056827,-33.032934004012787</coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
                <Polygon>
                    <outerBoundaryIs>
                        <LinearRing>
                            <coordinates>-57.635769389832561,-33.032934004012787 -57.618028722946228,-33.089724736921667 -57.681210673689904,-33.073777852969904 -57.635769389832561,-33.032934004012787</coordinates>
                        </LinearRing>
                    </outerBoundaryIs>
                </Polygon>
            </MultiGeometry>
        </Placemark>
    </Folder>
</Document>

整个文件的示例:link to GDrive

[编辑] 我应该补充一点,Google地图融合表强加了“地标”中字符数量的这一特定限制。每个地标都描述了地图上的特定功能,地图上可能有许多功能。如果任何Placemark中断1M字符限制,则转换为fusion表将失败。

1 个答案:

答案 0 :(得分:0)

我提出了一个粗略的Python脚本来完成这项工作。当然这不是最好的方法,所以如果你有一个更好的方法,我会很高兴看到它。此外,我提取物种ID的方式非常难看 - 热门的建议让它更漂亮也受到欢迎。

import glob
from collections import namedtuple
Placemark = namedtuple('Placemark', 'found no_characters specie_id end_idx')


def GetPlacemark(input_file, start):
    start_idx = input_file.find('Placemark', start)
    end_idx = input_file.find('/Placemark', start)
    if start_idx == -1 or end_idx == -1:
        return Placemark(False, -1, -1, -1)
    no_characters = end_idx - start_idx
    specie_name_idx = input_file.find('species', start_idx, end_idx)
    specie_id_start_idx = input_file.find('>', specie_name_idx)
    specie_id_end_idx = input_file.find('<', specie_name_idx)
    specie_id = int(data[specie_id_start_idx+1:specie_id_end_idx])
    return Placemark(True, no_characters, specie_id, end_idx)

path_to_kml = glob.glob('*.kml')
for kml_file in path_to_kml:
    print 'Processing ' + kml_file
    with open (kml_file, "r") as myfile:
        data=myfile.read().replace('\n', '')

    placemarks = []
    current_idx = 0

    while True:
        mark = GetPlacemark(data, current_idx)
        if mark.found:
            placemarks.append(mark)
            current_idx = mark.end_idx + 1
        else:
            break

    for placemark in placemarks:
        if placemark.no_characters > 1000000:
            print 'Specie %d has %d characters' % (placemark.specie_id, placemark.no_characters)
    print 'Done\n'