如何从KML / XML中提取数据?

时间:2013-05-31 16:02:44

标签: php ruby xml nokogiri hpricot

我有一些数据从KML文件转换为XML,我很好奇如何使用PHP或Ruby来获取邻居名称和坐标等内容。我知道他们周围有这样的标签。

<cities>
  <neighborhood>Gotham</neighborhood>
</cities>

但遗憾的是,数据的格式为:

<SimpleData name="neighborhd">Colgate Center</SimpleData>

而不是

<neighborhd>Colgate Center</neighborhd>

这是KML来源:

如何使用PHP或Ruby从这样的数据中提取数据?我安装了一些Ruby gems来解析XML数据,但XML只是我没有用过的东西。

1 个答案:

答案 0 :(得分:1)

您的XML无效,但Nokogiri会尝试修复它。

以下是检查无效XML / XHTML / HTML以及如何重写所需部分的方法。

以下是设置:

require 'nokogiri'

doc = Nokogiri.XML(<<EOT)
<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom">
  <Document>
    <Schema name="Sample_Neighborhoods_Samples" id="Sample_Neighborhoods_Samples">
      <SimpleField type="int" name="nid"/>
      <SimpleField type="string" name="neighborhd"/>
      <SimpleField type="string" name="place"/>
      <SimpleField type="string" name="placecode"/>
      <SimpleField type="string" name="nbr_type"/>
      <SimpleField type="string" name="po_name"/>
      <SimpleField type="string" name="metro"/>
      <SimpleField type="string" name="country"/>
      <SimpleField type="string" name="state"/>
      <SimpleField type="string" name="statefips"/>
      <SimpleField type="string" name="county"/>
      <SimpleField type="string" name="countyfips"/>
      <SimpleField type="string" name="mcd"/>
      <SimpleField type="string" name="mcdfips"/>
      <SimpleField type="string" name="cbsa"/>
      <SimpleField type="string" name="cbsacode"/>
      <SimpleField type="string" name="cbsatype"/>
      <SimpleField type="double" name="cenlat"/>
      <SimpleField type="double" name="cenlon"/>
      <SimpleField type="int" name="color"/>
      <SimpleField type="string" name="ncs_code"/>
      <SimpleField type="string" name="release"/>
    </Schema>
    <Style id="KMLSTYLER_6">
      <LabelStyle>
        <scale>1.0</scale>
      </LabelStyle>
      <LineStyle>
        <colorMode>normal</colorMode>
      </LineStyle>
      <PolyStyle>
        <color>7f4080ff</color>
        <colorMode>random</colorMode>
      </PolyStyle>
    </Style>
    <name>Sample_Neighborhoods_NYC</name>
    <visibility>1</visibility>
    <Folder id="kml_ft_Sample_Neighborhoods_Samples">
      <name>Sample_Neighborhoods_Samples</name>
      <Folder id="kml_ft_Sample_Neighborhoods_Samples_Sample_Neighborhoods_NYC">
        <name>Sample_Neighborhoods_NYC</name>
        <Placemark id="kml_1">
          <name>Colgate Center</name>
          <Snippet> </Snippet>
          <styleUrl>#KMLSTYLER_6</styleUrl>
          <ExtendedData>
            <SchemaData schemaUrl="#Sample_Neighborhoods_Samples">
              <SimpleData name="nid">7086</SimpleData>
              <SimpleData name="neighborhd">Colgate Center</SimpleData>
              <SimpleData name="place">Jersey City</SimpleData>
              <SimpleData name="placecode">36000</SimpleData>
              <SimpleData name="nbr_type">S</SimpleData>
              <SimpleData name="po_name">JERSEY CITY</SimpleData>
              <SimpleData name="metro">New York City, NY</SimpleData>
              <SimpleData name="country">USA</SimpleData>
              <SimpleData name="state">NJ</SimpleData>
              <SimpleData name="statefips">34</SimpleData>
              <SimpleData name="county">Hudson</SimpleData>
              <SimpleData name="countyfips">34017</SimpleData>
              <SimpleData name="mcd">Jersey City</SimpleData>
              <SimpleData name="mcdfips">36000</SimpleData>
              <SimpleData name="cbsa">New York-Northern New Jersey-Long Island, NY-NJ-PA</SimpleData>
              <SimpleData name="cbsacode">35620</SimpleData>
              <SimpleData name="cbsatype">Metro</SimpleData>
              <SimpleData name="cenlat">40.7145135000001</SimpleData>
              <SimpleData name="cenlon">-74.0343385</SimpleData>
              <SimpleData name="color">1</SimpleData>
              <SimpleData name="ncs_code">40910000</SimpleData>
              <SimpleData name="release">1.12.2</SimpleData>
            </SchemaData>
          </ExtendedData>
          <Polygon>
            <outerBoundaryIs>
              <LinearRing>
                <coordinates>-74.036628,40.712211,0 -74.0357779999999,40.7120810000001,0                     -74.035535,40.7122010000001,0 -74.0348299999999,40.71209,0 -74.034903,40.711804,0 -74.033761,40.7116560000001,0 -74.0334089999999,40.7121090000001,0 -74.032996,40.7141330000001,0 -74.0331899999999,40.7141790000001,0 -74.032656,40.7162500000001,0 -74.032231,40.716194,0 -74.032049,40.716908,0 -74.033871,40.7170370000001,0 -74.035629,40.7173710000001,0 -74.035669,40.7171650000001,0 -74.036009,40.715335,0 -74.036325,40.713625,0 -74.036482,40.7123580000001,0 -74.036628,40.712211,0 </coordinates>
              </LinearRing>
            </outerBoundaryIs>
          </Polygon>
        </Placemark>
        <Placemark id="kml_2">
          <name>Colgate Center</name>
          <Snippet> </Snippet>
          <ExtendedData>
EOT

以下是查看是否存在错误的方法。任何时候errors都不是空的,你就有问题了。

puts doc.errors

这是在整个文档中查找SimpleData节点的一种方法。出于可读性原因,我更喜欢在XPath上使用CSS访问器。有时XPath更好,因为它在搜索时允许更好的粒度。你需要学习它们。

doc.search('ExtendedData SimpleData').each do |simple_data|
  node_name = simple_data['name']
  puts "<%s>%s</%s>" % [node_name, simple_data.text.strip, node_name]
end

这是运行后的输出:

Premature end of data in tag ExtendedData line 87
Premature end of data in tag Placemark line 84
Premature end of data in tag Folder line 44
Premature end of data in tag Folder line 42
Premature end of data in tag Document line 3
Premature end of data in tag kml line 2
<nid>7086</nid>
<neighborhd>Colgate Center</neighborhd>
<place>Jersey City</place>
<placecode>36000</placecode>
<nbr_type>S</nbr_type>
<po_name>JERSEY CITY</po_name>
<metro>New York City, NY</metro>
<country>USA</country>
<state>NJ</state>
<statefips>34</statefips>
<county>Hudson</county>
<countyfips>34017</countyfips>
<mcd>Jersey City</mcd>
<mcdfips>36000</mcdfips>
<cbsa>New York-Northern New Jersey-Long Island, NY-NJ-PA</cbsa>
<cbsacode>35620</cbsacode>
<cbsatype>Metro</cbsatype>
<cenlat>40.7145135000001</cenlat>
<cenlon>-74.0343385</cenlon>
<color>1</color>
<ncs_code>40910000</ncs_code>
<release>1.12.2</release>

我不是要修改DOM,但这很容易做到:

doc.search('ExtendedData SimpleData').each do |simple_data|
  node_name = simple_data['name']
  simple_data.replace("<%s>%s</%s>" % [node_name, simple_data.text.strip, node_name])
end

puts doc.to_xml

运行后,这是受影响的部分:

<ExtendedData>
  <SchemaData schemaUrl="#Sample_Neighborhoods_Samples">
    <nid>7086</nid>
    <neighborhd>Colgate Center</neighborhd>
    <place>Jersey City</place>
    <placecode>36000</placecode>
    <nbr_type>S</nbr_type>
    <po_name>JERSEY CITY</po_name>
    <metro>New York City, NY</metro>
    <country>USA</country>
    <state>NJ</state>
    <statefips>34</statefips>
    <county>Hudson</county>
    <countyfips>34017</countyfips>
    <mcd>Jersey City</mcd>
    <mcdfips>36000</mcdfips>
    <cbsa>New York-Northern New Jersey-Long Island, NY-NJ-PA</cbsa>
    <cbsacode>35620</cbsacode>
    <cbsatype>Metro</cbsatype>
    <cenlat>40.7145135000001</cenlat>
    <cenlon>-74.0343385</cenlon>
    <color>1</color>
    <ncs_code>40910000</ncs_code>
    <release>1.12.2</release>
  </SchemaData>
</ExtendedData>