如何使用Spark RDD解析文本文件中的嵌套XML?

时间:2017-10-07 16:24:36

标签: apache-spark

我有一个xml,如:

1234^12^999^`<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>`^23232

我们可以使用scala XML支持甚至使用databricks xml格式轻松解析普通的xml文件,但是如何解析嵌入在文本中的xml。

可以使用以下方法提取XML数据:

val top5duration = data.map(line => line.split("^")).filter(line => {line(2)==100}).map(line => line(4))

但是,如果我想为每个键提取值,我该怎么办?

4 个答案:

答案 0 :(得分:0)

  

问题:如何处理嵌套的XML元素?我将如何访问   它们?

     

对于展平嵌套结构,您可以使用explode ...

     

示例:假设我想要每个标题(字符串类型)/   authors(WrappedArray)组合可以使用explode实现它:


schema :

root
 |-- title: string (nullable = true)
 |-- author: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- initial: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- lastName: string (nullable = true)
show()

+--------------------+--------------------+
|               title|              author|
+--------------------+--------------------+
|Proper Motions of...|[[WrappedArray(J,...|
|Catalogue of 2055...|[[WrappedArray(J,...|
|                null|                null|
|Katalog von 3356 ...|[[WrappedArray(J)...|
|Astrographic Cata...|[[WrappedArray(P)...|
|Astrographic Cata...|[[WrappedArray(P)...|
|Results of observ...|[[WrappedArray(H,...|
|      AGK3 Catalogue|[[WrappedArray(W)...|
|Perth 70: A Catal...|[[WrappedArray(E)...|


import org.apache.spark.sql.functions;
DataFrame exploded = src.select(src.col("title"),functions.explode(src.col("author")).as("auth"))
                    .select("title","auth.initial","auth.lastName");
exploded = exploded.select(exploded.col("initial"),
                        exploded.col("title").as("title"),
                        exploded.col("lastName"));

exploded.printSchema

exploded.show


root
 |-- initial: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- title: string (nullable = true)
 |-- lastName: string (nullable = true)

+-------+--------------------+-------------+
|initial|               title|     lastName|
+-------+--------------------+-------------+
| [J, H]|Proper Motions of...|      Spencer|
|    [J]|Proper Motions of...|      Jackson|
| [J, H]|Catalogue of 2055...|      Spencer|

示例xml文件

<?xml version='1.0' ?>
<!DOCTYPE datasets SYSTEM "http://www.cs.washington.edu/research/projects/xmltk/xmldata/data/nasa/dataset_053.dtd">
<datasets>
 <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9">
  <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees
of 20843 Stars for 1900</title>
  <altname type="ADC">1005</altname>
  <altname type="CDS">I/5</altname>
  <altname type="brief">Proper Motions in Cape Zone Catalogue -40/-52</altname>
  <reference>
   <source>
    <other>
     <title>Proper Motions of Stars in the Zone Catalogue -40 to -52 degrees
of 20843 Stars for 1900</title>
     <author>
      <initial>J</initial>
      <initial>H</initial>
      <lastName>Spencer</lastName>
     </author>
     <author>
      <initial>J</initial>
      <lastName>Jackson</lastName>
     </author>
     <name>His Majesty's Stationery Office, London</name>
     <publisher>???</publisher>
     <city>???</city>
     <date>
      <year>1936</year>
     </date>
    </other>
   </source>
  </reference>
  <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html">
   <keyword xlink:href="Positional_data.html">Positional data</keyword>
   <keyword xlink:href="Proper_motions.html">Proper motions</keyword>
  </keywords>
  <descriptions>
   <description>
    <para>This catalog, listing the proper motions of 20,843 stars
    from the Cape Astrographic Zones, was compiled from three series of
    photographic plates. The plates were taken at the Royal Observatory,
    Cape of Good Hope, in the following years: 1892-1896, 1897-1910,
    1923-1928. Data given include centennial proper motion, photographic
    and visual magnitude, Harvard spectral type, Cape Photographic
    Durchmusterung (CPD) identification, epoch, right ascension and
    declination for 1900.</para>
   </description>
   <details/>
  </descriptions>
  <tableHead>
   <tableLinks>
    <tableLink xlink:href="czc.dat">
     <title>The catalogue</title>
    </tableLink>
   </tableLinks>
   <fields>
    <field>
     <name>---</name>
     <definition>Number 5</definition>
     <units>---</units>
    </field>
    <field>
     <name>CZC</name>
     <definition>Catalogue Identification Number</definition>
     <units>---</units>
    </field>
    <field>
     <name>Vmag</name>
     <definition>Visual Magnitude</definition>
     <units>mag</units>
    </field>
    <field>
     <name>RAh</name>
     <definition>Right Ascension for 1900 hours</definition>
     <units>h</units>
    </field>
    <field>
     <name>RAm</name>
     <definition>Right Ascension for 1900 minutes</definition>
     <units>min</units>
    </field>
    <field>
     <name>RAcs</name>
     <definition>Right Ascension seconds in 0.01sec 1900</definition>
     <units>0.01s</units>
    </field>
    <field>
     <name>DE-</name>
     <definition>Declination Sign</definition>
     <units>---</units>
    </field>
    <field>
     <name>DEd</name>
     <definition>Declination for 1900 degrees</definition>
     <units>deg</units>
    </field>
    <field>
     <name>DEm</name>
     <definition>Declination for 1900 arcminutes</definition>
     <units>arcmin</units>
    </field>
    <field>
     <name>DEds</name>
     <definition>Declination for 1900 arcseconds</definition>
     <units>0.1arcsec</units>
    </field>
    <field>
     <name>Ep-1900</name>
     <definition>Epoch -1900</definition>
     <units>cyr</units>
    </field>
    <field>
     <name>CPDZone</name>
     <definition>Cape Photographic
                                        Durchmusterung Zone</definition>
     <units>---</units>
    </field>
    <field>
     <name>CPDNo</name>
     <definition>Cape Photographic Durchmusterung Number</definition>
     <units>---</units>
    </field>
    <field>
     <name>Pmag</name>
     <definition>Photographic Magnitude</definition>
     <units>mag</units>
    </field>
    <field>
     <name>Sp</name>
     <definition>HD Spectral Type</definition>
     <units>---</units>
    </field>
    <field>
     <name>pmRAs</name>
     <definition>Proper Motion in RA
      <footnote>
       <para>the relation is   pmRA = 15 * pmRAs * cos(DE)
    if pmRAs is expressed in s/yr and pmRA in arcsec/yr</para>
      </footnote>
     </definition>
     <units>0.1ms/yr</units>
    </field>
    <field>
     <name>pmRA</name>
     <definition>Proper Motion in RA</definition>
     <units>mas/yr</units>
    </field>
    <field>
     <name>pmDE</name>
     <definition>Proper Motion in Dec</definition>
     <units>mas/yr</units>
    </field>
   </fields>
  </tableHead>
  <history>
   <ingest>
    <creator>
     <lastName>Julie Anne Watko</lastName>
     <affiliation>SSDOO/ADC</affiliation>
    </creator>
    <date>
     <year>1995</year>
     <month>Nov</month>
     <day>03</day>
    </date>
   </ingest>
  </history>
  <identifier>I_5.xml</identifier>
 </dataset>
 <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9">
  <title>Catalogue of 20554 Faint Stars in the Cape Astrographic Zone -40 to -52 Degrees
for the Equinox of 1900.0</title>
  <altname type="ADC">1006</altname>
  <altname type="CDS">I/6</altname>
  <altname type="brief">Cape 20554 Faint Stars, -40 to -52, 1900.0</altname>
  <reference>
   <source>
    <other>
     <title>Catalogue of 20554 Faint Stars in the Cape Astrographic Zone -40 to -52 Degrees
for the Equinox of 1900.0</title>
     <author>
      <initial>J</initial>
      <initial>H</initial>
      <lastName>Spencer</lastName>
     </author>
     <author>
      <initial>J</initial>
      <lastName>Jackson</lastName>
     </author>
     <name>His Majesty's Stationery Office, London</name>
     <publisher>???</publisher>
     <city>???</city>
     <date>
      <year>1939</year>
     </date>
     <bibcode>1939HMSO..C......0S</bibcode>
    </other>
   </source>
  </reference>
  <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html">
   <keyword xlink:href="Positional_data.html">Positional data</keyword>
   <keyword xlink:href="Proper_motions.html">Proper motions</keyword>
  </keywords>
  <descriptions>
   <description>
    <para>This catalog contains positions, precessions, proper motions, and
  photographic magnitudes for 20,554 stars.  These were derived from
  photographs taken at the Royal Observatory, Cape of Good Hope between 1923
  and 1928.  It covers the astrographic zones -40 degrees to -52 degrees of
  declination.  The positions are given for epoch 1900 (1900.0).  It includes
  spectral types for many of the stars listed.  It extends the earlier
  catalogs derived from the same plates to fainter magnitudes.  The
  computer-readable version consists of a single data table.</para>
    <para>The stated probable error for the star positions is 0.024 seconds of time
  (R.A.) and 0.25 seconds of arc (dec.) for stars with one determination,
  0.017 seconds of time, and 0.18 seconds of arc for two determinations, and
  0.014 / 0.15 for stars with three determinations.</para>
    <para>The precession and secular variations were derived from Newcomb's constants.</para>
    <para>The authors quote probable errors of the proper motions in both coordinates
  of 0.008 seconds of arc for stars with one determination, 0.0055 seconds for
  stars with two determinations, and 0.0044 for stars with three.</para>
    <para>The photographic magnitudes were derived from the measured diameters on the
  photographic plates and from the magnitudes given in the Cape Photographic
  Durchmusterung.</para>
    <para>The spectral classification of the cataloged stars was done with the
  assistance of Annie Jump Cannon of the Harvard College Observatory.</para>
    <para>The user should consult the source reference for more details of the
  measurements and reductions.  See also the notes in this document for
  additional information on the interpretation of the entries.</para>
   </description>
   <details/>
  </descriptions>
  <tableHead>
   <tableLinks>
    <tableLink xlink:href="faint.dat">
     <title>Data</title>
    </tableLink>
   </tableLinks>
   <fields>
    <field>
     <name>ID</name>
     <definition>Cape Number</definition>
     <units>---</units>
    </field>
    <field>
     <name>rem</name>
     <definition>Remark
      <footnote>
       <para>A = Astrographic Star
   F = Faint Proper Motion Star
   N = Other Note</para>
      </footnote>
     </definition>
     <units>---</units>
    </field>
    <field>
     <name>CPDZone</name>
     <definition>Cape Phot. Durchmusterung (CPD) Zone
      <footnote>
       <para>All CPD Zones are negative. - signs are not included in data.
        "0" in column 8 signifies Astrographic Plate instead of CPD.</para>
      </footnote>
     </definition>
     <units>---</units>
    </field>
    <field>
     <name>CPD</name>
     <definition>CPD Number or Astrographic Plate
      <footnote>
       <para>See also note on CPDZone.
        Astrographic plate listed "is the more southerly on which the
        star occurs." Thus, y-coordinate is positive wherever possible.</para>
      </footnote>
     </definition>
     <units>---</units>
    </field>
    <field>
     <name>n_CPD</name>
     <definition>[1234] Remarks
      <footnote>
       <para>A number from 1-4 appears in this byte for double stars where
    the same CPD number applies to more than one star.</para>
      </footnote>
     </definition>
     <units>---</units>
    </field>
    <field>
     <name>mpg</name>
     <definition>Photographic Magnitude
      <footnote>
       <para>The Photographic Magnitude is "determined from the CPD Magnitude
        and the diameter on the Cape Astrographic Plates by means of the
        data given in the volume on the Magnitudes of Stars in the Cape
        Zone Catalogue."
    A null value (99.9) signifies a variable star.</para>
      </footnote>
     </definition>
     <units>mag</units>
    </field>
    <field>
     <name>RAh</name>
     <definition>Mean Right Ascension hours 1900</definition>
     <units>h</units>
    </field>
    <field>
     <name>RAm</name>
     <definition>Mean Right Ascension minutes 1900</definition>
     <units>min</units>
    </field>
    <field>
     <name>RAs</name>
     <definition>Mean Right Ascension seconds 1900</definition>
     <units>s</units>
    </field>
    <field>
     <name>DEd</name>
     <definition>Mean Declination degrees 1900</definition>
     <units>deg</units>
    </field>
    <field>
     <name>DEm</name>
     <definition>Mean Declination arcminutes 1900</definition>
     <units>arcmin</units>
    </field>
    <field>
     <name>DEs</name>
     <definition>Mean Declination arcseconds 1900</definition>
     <units>arcsec</units>
    </field>
    <field>
     <name>N</name>
     <definition>Number of Observations</definition>
     <units>---</units>
    </field>
    <field>
     <name>Epoch</name>
     <definition>Epoch +1900</definition>
     <units>yr</units>
    </field>
    <field>
     <name>pmRA</name>
     <definition>Proper Motion in RA seconds of time</definition>
     <units>s/a</units>
    </field>
    <field>
     <name>pmRAas</name>
     <definition>Proper Motion in RA arcseconds</definition>
     <units>arcsec/a</units>
    </field>
    <field>
     <name>pmDE</name>
     <definition>Proper Motion in Dec arcseconds</definition>
     <units>arcsec/a</units>
    </field>
    <field>
     <name>Sp</name>
     <definition>HD Spectral Type</definition>
     <units>---</units>
    </field>
   </fields>
  </tableHead>
  <history>
   <ingest>
    <creator>
     <lastName>Julie Anne Watko</lastName>
     <affiliation>SSDOO/ADC</affiliation>
    </creator>
    <date>
     <year>1996</year>
     <month>Mar</month>
     <day>26</day>
    </date>
   </ingest>
  </history>
  <identifier>I_6.xml</identifier>
 </dataset>
 <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9">
  <title>Proper Motions of 1160 Late-Type Stars</title>
  <altname type="ADC">1014</altname>
  <altname type="CDS">I/14</altname>
  <altname type="brief">Proper Motions of 1160 Late-Type Stars</altname>
  <reference>
   <source>
    <journal>
     <title>Proper Motions of 1160 Late-Type Stars</title>
     <author>
      <initial>H</initial>
      <initial>J</initial>
      <lastName>Fogh Olsen</lastName>
     </author>
     <name>Astron. Astrophys. Suppl. Ser.</name>
     <volume>2</volume>
     <pageno>69</pageno>
     <date>
      <year>1970</year>
     </date>
     <bibcode>1970A&amp;AS....2...69O</bibcode>
    </journal>
   </source>
   <related>
    <holding role="similar">II/38 : Stars observed photoelectrically by Dickow et al.
     <xlink:simple href="II/38"/>
    </holding>Fogh Olsen H.J. 1970, Astron. Astrophys. Suppl. Ser., 2, 69.
   Fogh Olsen H.J. 1970, Astron. Astrophys., Suppl. Ser., 1, 189.</related>
  </reference>
  <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html">
   <keyword xlink:href="Proper_motions.html">Proper motions</keyword>
  </keywords>
  <descriptions>
   <description>
    <para>Improved proper motions for the 1160 stars contained in the photometric
   catalog by Dickow et al. (1970) are presented. Most of the proper motions
   are from the GC, transferred to the system of FK4. For stars not included
   in the GC, preliminary AGK or SAO proper motions are given. Fogh Olsen
   (Astron. Astrophys. Suppl. Ser., 1, 189, 1970) describes the method of
   improvement. The mean errors of the centennial proper motions increase with
   increasing magnitude. In Right Ascension, these range from 0.0043/cos(dec)
   for very bright stars to 0.096/cos(dec) for the faintest stars. In Dec-
   lination, the range is from 0.065 to 1.14.</para>
   </description>
   <details/>
  </descriptions>
  <tableHead>
   <tableLinks>
    <tableLink xlink:href="pmlate.dat">
     <title>Proper motion data</title>
    </tableLink>
   </tableLinks>
   <fields>
    <field>
     <name>No</name>
     <definition>Number
      <footnote>
       <para>Henry Draper or Bonner Durchmusterung number</para>
      </footnote>
     </definition>
     <units>---</units>
    </field>
    <field>
     <name>pmRA</name>
     <definition>Centennial Proper Motion RA</definition>
     <units>s/ca</units>
    </field>
    <field>
     <name>pmDE</name>
     <definition>Centennial Proper Motion Dec</definition>
     <units>arcsec/ca</units>
    </field>
    <field>
     <name>RV</name>
     <definition>Radial Velocity</definition>
     <units>km/s</units>
    </field>
   </fields>
  </tableHead>
  <history>
   <ingest>
    <creator>
     <lastName>Julie Anne Watko</lastName>
     <affiliation>ADC</affiliation>
    </creator>
    <date>
     <year>1996</year>
     <month>Jun</month>
     <day>03</day>
    </date>
   </ingest>
  </history>
  <identifier>I_14.xml</identifier>
 </dataset>
 <dataset subject="astronomy" xmlns:xlink="http://www.w3.org/XML/XLink/0.9">
  <title>Katalog von 3356 Schwachen Sternen fuer das Aequinoktium 1950
+89 degrees</title>
  <altname type="ADC">1016</altname>
  <altname type="CDS">I/16</altname>
  <altname type="brief">Catalog of 3356 Faint Stars, 1950</altname>
  <reference>
   <source>
    <other>
     <title>Katalog von 3356 Schwachen Sternen fuer das Aequinoktium 1950
+89 degrees</title>
     <author>
      <initial>J</initial>
      <lastName>Larink</lastName>
     </author>
     <author>
      <initial>A</initial>
      <lastName>Bohrmann</lastName>
     </author>
     <author>
      <initial>H</initial>
      <lastName>Kox</lastName>
     </author>
     <author>
      <initial>J</initial>
      <lastName>Groeneveld</lastName>
     </author>
     <author>
      <initial>H</initial>
      <lastName>Klauder</lastName>
     </author>
     <name>Verlag der Sternwarte, Hamburg-Bergedorf</name>
     <publisher>???</publisher>
     <city>???</city>
     <date>
      <year>1955</year>
     </date>
     <bibcode>1955</bibcode>
    </other>
   </source>
  </reference>
  <keywords parentListURL="http://messier.gsfc.nasa.gov/xml/keywordlists/adc_keywords.html">
   <keyword xlink:href="Fundamental_catalog.html">Fundamental catalog</keyword>
   <keyword xlink:href="Positional_data.html">Positional data</keyword>
   <keyword xlink:href="Proper_motions.html">Proper motions</keyword>
  </keywords>
  <descriptions>
   <description>
    <para>This catalog of 3356 faint stars was derived from meridian circle
   observations at the Bergedorf and Heidelberg Observatories. The
   positions are given for the equinox 1950 on the FK3 system. The stars
   are mainly between 8.0 and 10.0 visual magnitude. A few are brighter
   than 8.0 mag. The lower limit in brightness resulted from the visibility
   of the stars.</para>
   </description>
   <details>
    <para>All stars were observed at both the Heidelberg and Bergedorf
   Observatories. Normally, at each observatory, two observations were
   obtained with the clamp east and two with the clamp west. The mean
   errors are comparable for the two observatories with no significant
   systematic difference in the positions between them. The mean errors of
   the resulting positions should be approximated 0.011s/cos(dec) in right
   ascension and ).023" in declination.</para>
    <para>The proper motions were derived from a comparison with the catalog
   positions with the positions in the AGK2 and AGK2A with a 19 year
   baseline and from a comparison of new positions with those in Kuestner
   1900 with about a fifty year baseline.</para>
    <para>The magnitudes were taken from the AGK2. Most spectral types were
   determined by A. N. Vyssotsky. A few are from the Bergedorfer
   Spektraldurchmusterung.</para>
   </details>
  </descriptions>
  <tableHead>
   <tableLinks>
    <tableLink xlink:href="catalog.dat">
     <title>The catalog</title>
    </tableLink>
   </tableLinks>
   <fields>
    <field>
     <name>ID</name>
     <definition>Catalog number</definition>
     <units>---</units>
    </field>
    <field>
     <name>DMz</name>
     <definition>BD zone</definition>
     <units>---</units>
    </field>
    <field>
     <name>DMn</name>
     <definition>BD number</definition>
     <units>---</units>
    </field>
    <field>
     <name>mag</name>
     <definition>Photographic magnitude</definition>
     <units>mag</units>
    </field>
    <field>
     <name>Sp</name>
     <definition>Spectral class</definition>
     <units>---</units>
    </field>
    <field>
     <name>RAh</name>
     <definition>Right Ascension hours (1950)</definition>
     <units>h</units>
    </field>
    <field>
     <name>RAm</name>
     <definition>Right Ascension minutes (1950)</definition>
     <units>min</units>
    </field>
    <field>
     <name>RAs</name>
     <definition>Right Ascension seconds (1950)</definition>
     <units>s</units>
    </field>
    <field>
     <name>Pr-RA1</name>
     <definition>First order precession in RA per century</definition>
     <units>0.01s/a</units>
    </field>
    <field>
     <name>Pr-RA2</name>
     <definition>Second order precession in RA per century</definition>
     <units>0.0001s2/a2</units>
    </field>
    <field>
     <name>pmRA</name>
     <definition>Proper motion in RA from AGK2 positions</definition>
     <units>0.01s/a</units>
    </field>
    <field>
     <name>pmRA2</name>
     <definition>Proper motion in RA from Kuestner positions</definition>
     <units>0.01s/a</units>
    </field>
    <field>
     <name>DE-</name>
     <definition>Sign of declination (1950)</definition>
     <units>---</units>
    </field>
    <field>
     <name>DEd</name>
     <definition>Declination degrees (1950)</definition>
     <units>deg</units>
    </field>
    <field>
     <name>DEm</name>
     <definition>Declination minutes (1950)</definition>
     <units>arcmin</units>
    </field>
    <field>
     <name>DEs</name>
     <definition>Declination seconds (1950)</definition>
     <units>arcsec</units>
    </field>
    <field>
     <name>Pr-de1</name>
     <definition>First order precession in dec per century</definition>
     <units>arcsec/ha</units>
    </field>
    <field>
     <name>Pr-de2</name>
     <definition>Second order precession in dec per century</definition>
     <units>arcsec2/ha2</units>
    </field>
    <field>
     <name>pmdec</name>
     <definition>Proper motion in DE from AGK2 positions</definition>
     <units>arcsec/ha</units>
    </field>
    <field>
     <name>pmdec2</name>
     <definition>Proper motion in DE from Kuestner positions</definition>
     <units>arcsec/ha</units>
    </field>
    <field>
     <name>epoch</name>
     <definition>Epoch of observation - 1900.0</definition>
     <units>yr</units>
    </field>
    <field>
     <name>rem</name>
     <definition>Note for star in printed catalog
      <footnote>
       <para>1 = ma (blend?)
   3 = pr (preceding)
   4 = seq (following)
   5 = bor (northern)
   6 = au (southern)
   * = other note in printed volume (All notes in the printed volume have not
       been indicated in this version.)
   the printed volume sometimes has additional information on the systems with
   numerical remarks.</para>
      </footnote>
     </definition>
     <units>---</units>
    </field>
   </fields>
  </tableHead>
  <history>
   <ingest>
    <creator>
     <lastName>Nancy Grace Roman</lastName>
     <affiliation>ADC/SSDOO</affiliation>
    </creator>
    <date>
     <year>1996</year>
     <month>Feb</month>
     <day>01</day>
    </date>
   </ingest>
  </history>
  <identifier>I_16.xml</identifier>
 </dataset>
</datasets>

答案 1 :(得分:0)

如果您只使用RDD [String]格式的XML, 您可以使用Databricks实用程序类将其转换为DataFrame:

com.databricks.spark.xml.XmlReader#xmlRdd

答案 2 :(得分:0)

您可以使用SGML解析文本文件,使用SGML的SHORTREF功能来解析像您和Wiki语法这样的混合CSV。使用SHORTREF,您可以将文本标记声明为替换为其他文本(通常是开始和结束元素标记)。

<DOCTYPE data [
  <!ELEMENT data O O (field+)>
  <!ELEMENT field O O (#PCDATA|markup)>
  <!ELEMENT markup O O (row)>
  <!ELEMENT row - - (ab+)>
  <!ELEMENT ab - - (#PCDATA)>
  <!ENTITY start-field "<field>">
  <!SHORTREF in-data "^" start-field>
  <!USEMAP in-data data>
  <!ENTITY start-markup "<markup>">
  <!ENTITY end-markup "</markup>">
  <!SHORTREF in-field "`" start-markup>
  <!USEMAP in-field field>
  <!SHORTREF in-markup "`" end-markup>
  <!USEMAP in-markup markup>
]>
1234^12^999^`<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>`^23232

使用SGML解析这将导致以下

<data>
  <field>1234</field>
  <field>12</field>
  <field>999</field>
  <field>
    <markup>
      <row>
        <ab key="someKey" value="someValue"/>
        <ab key="someKey1" value="someValue1"/>
      </row>
    </markup>
  </field>
  <field>23232</field>
</data>

SHORTREFUSEMAP声明告诉SGML在<field>子内容中将插入符号视为data的起始元素标记,并处理反引号在markup子内容中,field的字符作为开始元素标记。在markup子内容中,另一个反引号字符结束markup元素。

SGML还会根据O省略指标和内容模型规则推断出省略的起始和终止元素标签。

编辑:要在不更改数据文件(例如datafile.csv)的情况下完成此工作,而不是将内容逐字地包含到主SGML文件中,请声明并将实体引用放入其中,如下所示:

<!DOCTYPE data [
  <!-- ... same declarations as above ... -->
  <ENTITY datafile SYSTEM "datafile.csv">
]>
&datafile

SGML会将datafile.csv的内容提取到datafile实体中,并将​​&datafile实体引用替换为文件内容。

答案 3 :(得分:0)

我尝试在不使用RDD级别的xplode(dataframe)的情况下解析所提到的数据。请提出任何改进建议。

  1. 将数据作为文本文件读取并定义架构
  2. 使用分隔符拆分字符串^
  3. 过滤掉不符合架构的不良记录
  4. 将数据与先前定义的模式进行匹配。
  5. 现在,您将在元组中获得如下数据,我们将解析中间的xml数据。

    (1234,12,999,"<row><ab key="someKey" value="someValue"/><ab key="someKey1" value="someValue1"/></row>, 23232)
    
  6. xml.attribute(&#34; key&#34;)因为它将返回所有键。

  7. 如果你需要值someValue而对someValue1不感兴趣,那么循环遍历这个节点序列并应用contains(&#34; key&#34;)的过滤器以消除其他键。我使用了数据中存在的密钥持续时间。
  8. 应用xpath \&#34; @ value&#34;在上一步获得价值。
  9. similar question in cloudera

    //define a case class for schema match with data input
    
    case class stb (server_unique_id:Int,request_type:Int,event_id:Int,stb_timestamp:String,stb_xml:String,device_id:String,secondary_timestamp: String)
    
    val data = spark.read.textFile(args(0)).rdd;///read data from supplied path from CLI
    
    //check for ^ delimiter and 7 fields, else filter out
    
    var clean_Data = data.filter { line => {line.trim().contains("^")}}
    .map { line => {line.split("\\^")}}
    .filter{ line => line.length == 7}
    
    //match the schema and filter out data having event id = 100 and the tag having Duration
    
     var tup_Map = clean_Data.map{ line => stb (line(0).toInt,line(1).toInt,line(2).toInt,line(3),line(4),line(5),line(6))}
    .filter(line => (line.event_id == 100 && line.stb_xml.contains("Duration")));
    
    //xml is of name-value format, hence the attrbutes are all same(n,v)
    
    //parse through the xml structure and find out necessary data
    
    //xmlnv will parse top level to nodeseq having 8 different data like duration,channel in self closing tags
    
    //and name-value format
    
    var xml_Map = tup_Map.map{line =>
    var xmld = XML.loadString(line.stb_xml);
    var xmlnv = xmld \\ "nv";
    
    var duration = 0;
    for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("Duration") } duration = (xmlnv(i) \\ "@v").text.toInt;
    
    var channelNum = 0;
    for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelNumber") } channelNum = (xmlnv(i) \\ "@v").text.toInt;
    
    var channelType = "";
    for { i <- 0 to xmlnv.length-1 if xmlnv(i).attributes.toString().contains("ChannelType") } channelType = (xmlnv(i) \\ "@v").text;
    
    (duration, channelNum, channelType,line.device_id)
    }
    
    //persist xml_Map for further operations
    
    xml_Map.persist();