我在解析XML文件以转换为pandas数据帧时遇到问题。下面是一个示例条目:
<p>
<persName id="t17200427-2-defend31" type="defendantName">
Alice
Jones
<interp inst="t17200427-2-defend31" type="surname" value="Jones"/>
<interp inst="t17200427-2-defend31" type="given" value="Alice"/>
<interp inst="t17200427-2-defend31" type="gender" value="female"/>
</persName>
, of <placeName id="t17200427-2-defloc7">St. Michael's Cornhill</placeName>
<interp inst="t17200427-2-defloc7" type="placeName" value="St. Michael's Cornhill"/>
<interp inst="t17200427-2-defloc7" type="type" value="defendantHome"/>
<join result="persNamePlace" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-defloc7"/>, was indicted for <rs id="t17200427-2-off8" type="offenceDescription">
<interp inst="t17200427-2-off8" type="offenceCategory" value="theft"/>
<interp inst="t17200427-2-off8" type="offenceSubcategory" value="shoplifting"/>
privately stealing a Bermundas Hat, value 10 s. out of the Shop of
<persName id="t17200427-2-victim33" type="victimName">
Edward
Hillior
<interp inst="t17200427-2-victim33" type="surname" value="Hillior"/>
<interp inst="t17200427-2-victim33" type="given" value="Edward"/>
<interp inst="t17200427-2-victim33" type="gender" value="male"/>
<join result="offenceVictim" targOrder="Y" targets="t17200427-2-off8 t17200427-2-victim33"/>
</persName>
</rs> , on the <rs id="t17200427-2-cd9" type="crimeDate">21st of April</rs>
<join result="offenceCrimeDate" targOrder="Y" targets="t17200427-2-off8 t17200427-2-cd9"/> last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her <rs id="t17200427-2-verdict10" type="verdictDescription">
<interp inst="t17200427-2-verdict10" type="verdictCategory" value="guilty"/>
<interp inst="t17200427-2-verdict10" type="verdictSubcategory" value="theftunder1s"/>
Guilty to the value of 10 d.
</rs>
<rs id="t17200427-2-punish11" type="punishmentDescription">
<interp inst="t17200427-2-punish11" type="punishmentCategory" value="transport"/>
<join result="defendantPunishment" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-punish11"/>
Transportation
</rs> .</p>
我想要一个包含性别,攻击和试用文本列的数据框。我之前已将所有数据提取到数据框中,但无法在
标记之间获取文本。
这是一个示例代码:
def table_of_cases(xml_file_name):
file = ET.ElementTree(file = xml_file_name)
iterate = file.getiterator()
i = 1
table = pd.DataFrame()
for element in iterate:
if element.tag == "persName":
t = element.attrib['type']
try:
val = [element.attrib['value']]
if t not in labels:
table[t] = val
elif t+num not in labels:
table[t+num] = val
elif t+num in labels:
num = str(i+1)
table[t+num] = val
except Exception:
pass
labels = list(table.columns.values)
num = str(i)
return table
**我有大约1,000多个这些相同XML格式的文件可以制作成一个数据帧
答案 0 :(得分:2)
因为XML非常复杂,文本值跨越节点,所以请考虑XSLT,这种专用语言旨在将XML文件转换为特别复杂的简单文件。
Python的第三方模块lxml
可以运行XSLT 1.0甚至XPath 1.0来解析转换结果以迁移到pandas
数据帧。此外,您可以使用Python可以使用subprocess
调用的外部XSLT processors。
具体来说,在XSLT下面,通过使用来自根的XPath&#39; s descendant::*
,从被告和受害者以及整个段落文本值中提取必要的属性,假设<p>
是其中的孩子。
XSLT (另存为.xsl文件,特殊的.xml文件)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/*">
<xsl:apply-templates select="p"/>
</xsl:template>
<xsl:template match="p">
<data>
<defendantName><xsl:value-of select="normalize-space(descendant::persName[@type='defendantName'])"/></defendantName>
<defendantGender><xsl:value-of select="descendant::persName[@type='defendantName']/interp[@type='gender']/@value"/></defendantGender>
<offenceCategory><xsl:value-of select="descendant::interp[@type='offenceCategory']/@value"/></offenceCategory>
<offenceSubCategory><xsl:value-of select="descendant::interp[@type='offenceSubcategory']/@value"/></offenceSubCategory>
<victimName><xsl:value-of select="normalize-space(descendant::persName[@type='victimName'])"/></victimName>
<victimGender><xsl:value-of select="descendant::persName[@type='victimName']/interp[@type='gender']/@value"/></victimGender>
<verdictCategory><xsl:value-of select="descendant::interp[@type='verdictCategory']/@value"/></verdictCategory>
<verdictSubCategory><xsl:value-of select="descendant::interp[@type='verdictSubcategory']/@value"/></verdictSubCategory>
<punishmentCategory><xsl:value-of select="descendant::interp[@type='punishmentCategory']/@value"/></punishmentCategory>
<trialText><xsl:value-of select="normalize-space(/p)"/></trialText>
</data>
</xsl:template>
</xsl:stylesheet>
<强>的Python 强>
import lxml.etree as et
import pandas as pd
# LOAD XML AND XSL
doc = et.parse("Source.xml")
xsl = et.parse("XSLT_Script.xsl")
# RUN TRANSFORMATION
transformer = et.XSLT(xsl)
result = transformer(doc)
# OUTPUT TO CONSOLE
print(result)
data = []
for i in result.xpath('/*'):
inner = {}
for j in i.xpath('*'):
inner[j.tag] = j.text
data.append(inner)
trial_df = pd.DataFrame(data)
print(trial_df)
对于1,000个类似的XML文件,循环执行此过程并将每个单行trial_df数据框附加到列表中,以便与pd.concat
堆叠。
XML输出
<?xml version="1.0"?>
<data>
<defendantName>Alice Jones</defendantName>
<defendantGender>female</defendantGender>
<offenceCategory>theft</offenceCategory>
<offenceSubCategory>shoplifting</offenceSubCategory>
<victimName>Edward Hillior</victimName>
<victimGender>male</victimGender>
<verdictCategory>guilty</verdictCategory>
<verdictSubCategory>theftunder1s</verdictSubCategory>
<punishmentCategory>transport</punishmentCategory>
<trialText>Alice Jones , of St. Michael's Cornhill, was indicted for privately stealing a Bermundas Hat, value 10 s. out of the Shop of Edward Hillior , on the 21st of April last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her Guilty to the value of 10 d. Transportation .</trialText>
</data>
数据框输出
# defendantGender defendantName offenceCategory offenceSubCategory \
# 0 female Alice Jones theft shoplifting
# punishmentCategory trialText \
# 0 transport Alice Jones , of St. Michael's Cornhill, was i...
# verdictCategory verdictSubCategory victimGender victimName
# 0 guilty theftunder1s male Edward Hillior