嵌套的xml文件到pandas dataframe

时间:2018-03-22 21:54:27

标签: python xml pandas dataframe

我在解析XML文件以转换为pandas数据帧时遇到问题。下面是一个示例条目:

<p>


 <persName id="t17200427-2-defend31" type="defendantName">
 Alice 
 Jones 
 <interp inst="t17200427-2-defend31" type="surname" value="Jones"/>
 <interp inst="t17200427-2-defend31" type="given" value="Alice"/>
 <interp inst="t17200427-2-defend31" type="gender" value="female"/>
 </persName> 

 , of <placeName id="t17200427-2-defloc7">St. Michael's Cornhill</placeName> 
 <interp inst="t17200427-2-defloc7" type="placeName" value="St. Michael's Cornhill"/>
 <interp inst="t17200427-2-defloc7" type="type" value="defendantHome"/>
 <join result="persNamePlace" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-defloc7"/>, was indicted for <rs id="t17200427-2-off8" type="offenceDescription">
 <interp inst="t17200427-2-off8" type="offenceCategory" value="theft"/>
 <interp inst="t17200427-2-off8" type="offenceSubcategory" value="shoplifting"/>
 privately stealing a Bermundas Hat, value 10 s. out of the Shop of 

 <persName id="t17200427-2-victim33" type="victimName">
 Edward 
 Hillior 
 <interp inst="t17200427-2-victim33" type="surname" value="Hillior"/>
 <interp inst="t17200427-2-victim33" type="given" value="Edward"/>
 <interp inst="t17200427-2-victim33" type="gender" value="male"/>
 <join result="offenceVictim" targOrder="Y" targets="t17200427-2-off8 t17200427-2-victim33"/>
 </persName> 



 </rs> , on the <rs id="t17200427-2-cd9" type="crimeDate">21st of April</rs> 
 <join result="offenceCrimeDate" targOrder="Y" targets="t17200427-2-off8 t17200427-2-cd9"/> last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her <rs id="t17200427-2-verdict10" type="verdictDescription">
 <interp inst="t17200427-2-verdict10" type="verdictCategory" value="guilty"/>
 <interp inst="t17200427-2-verdict10" type="verdictSubcategory" value="theftunder1s"/>
 Guilty to the value of 10 d.
 </rs> 
 <rs id="t17200427-2-punish11" type="punishmentDescription">
 <interp inst="t17200427-2-punish11" type="punishmentCategory" value="transport"/>
 <join result="defendantPunishment" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-punish11"/>
 Transportation
 </rs> .</p>

我想要一个包含性别,攻击和试用文本列的数据框。我之前已将所有数据提取到数据框中,但无法在

标记之间获取文本。

这是一个示例代码:

def table_of_cases(xml_file_name):
    file = ET.ElementTree(file = xml_file_name)
    iterate = file.getiterator()
    i = 1
    table = pd.DataFrame()
    for element in iterate:
        if element.tag == "persName":
            t = element.attrib['type']
            try:
                val = [element.attrib['value']]
                if t not in labels:
                    table[t] = val
                elif t+num not in labels:
                    table[t+num] = val
                elif t+num in labels:
                    num = str(i+1)
                    table[t+num] = val
            except Exception:
                pass
            labels = list(table.columns.values)
            num = str(i)

    return table

**我有大约1,000多个这些相同XML格式的文件可以制作成一个数据帧

1 个答案:

答案 0 :(得分:2)

因为XML非常复杂,文本值跨越节点,所以请考虑XSLT,这种专用语言旨在将XML文件转换为特别复杂的简单文件。

Python的第三方模块lxml可以运行XSLT 1.0甚至XPath 1.0来解析转换结果以迁移到pandas数据帧。此外,您可以使用Python可以使用subprocess调用的外部XSLT processors

具体来说,在XSLT下面,通过使用来自根的XPath&#39; s descendant::*,从被告和受害者以及整个段落文本值中提取必要的属性,假设<p>是其中的孩子。

XSLT (另存为.xsl文件,特殊的.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" method="xml"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/*">
    <xsl:apply-templates select="p"/>
  </xsl:template>

  <xsl:template match="p">
    <data>
      <defendantName><xsl:value-of select="normalize-space(descendant::persName[@type='defendantName'])"/></defendantName>
      <defendantGender><xsl:value-of select="descendant::persName[@type='defendantName']/interp[@type='gender']/@value"/></defendantGender>
      <offenceCategory><xsl:value-of select="descendant::interp[@type='offenceCategory']/@value"/></offenceCategory>
      <offenceSubCategory><xsl:value-of select="descendant::interp[@type='offenceSubcategory']/@value"/></offenceSubCategory>

      <victimName><xsl:value-of select="normalize-space(descendant::persName[@type='victimName'])"/></victimName>
      <victimGender><xsl:value-of select="descendant::persName[@type='victimName']/interp[@type='gender']/@value"/></victimGender>
      <verdictCategory><xsl:value-of select="descendant::interp[@type='verdictCategory']/@value"/></verdictCategory>
      <verdictSubCategory><xsl:value-of select="descendant::interp[@type='verdictSubcategory']/@value"/></verdictSubCategory>
      <punishmentCategory><xsl:value-of select="descendant::interp[@type='punishmentCategory']/@value"/></punishmentCategory>

      <trialText><xsl:value-of select="normalize-space(/p)"/></trialText>
    </data>
  </xsl:template>       

</xsl:stylesheet>

<强>的Python

import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL
doc = et.parse("Source.xml")
xsl = et.parse("XSLT_Script.xsl")

# RUN TRANSFORMATION
transformer = et.XSLT(xsl)
result = transformer(doc)

# OUTPUT TO CONSOLE
print(result)

data = []
for i in result.xpath('/*'):
    inner = {}
    for j in i.xpath('*'):
        inner[j.tag] = j.text

    data.append(inner)

trial_df = pd.DataFrame(data)

print(trial_df)

对于1,000个类似的XML文件,循环执行此过程并将每个单行trial_df数据框附加到列表中,以便与pd.concat堆叠。

XML输出

<?xml version="1.0"?>
<data>
  <defendantName>Alice Jones</defendantName>
  <defendantGender>female</defendantGender>
  <offenceCategory>theft</offenceCategory>
  <offenceSubCategory>shoplifting</offenceSubCategory>
  <victimName>Edward Hillior</victimName>
  <victimGender>male</victimGender>
  <verdictCategory>guilty</verdictCategory>
  <verdictSubCategory>theftunder1s</verdictSubCategory>
  <punishmentCategory>transport</punishmentCategory>
  <trialText>Alice Jones , of St. Michael's Cornhill, was indicted for privately stealing a Bermundas Hat, value 10 s. out of the Shop of Edward Hillior , on the 21st of April last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her Guilty to the value of 10 d. Transportation .</trialText>
</data>

数据框输出

#   defendantGender defendantName offenceCategory offenceSubCategory  \
# 0          female   Alice Jones           theft        shoplifting   

#   punishmentCategory                                          trialText  \
# 0          transport  Alice Jones , of St. Michael's Cornhill, was i...   

#   verdictCategory verdictSubCategory victimGender      victimName  
# 0          guilty       theftunder1s         male  Edward Hillior