将多个XML文件解析为一个CSV

时间:2019-07-26 23:07:07

标签: python r xml xml-parsing

我在一个文件夹中存在多个XML文件,并且需要检索少量标签的信息。在Excel中,想要检索AbstractText Label="FINDINGS"AbstractText Label="IMPRESSION"parentImage id标签的详细信息并存储此信息从所有XML文件转换为csv。

我想从给定的示例XML文件中检索AbstractText Label="FINDINGS"AbstractText Label="IMPRESSION"parentImage id的详细信息,并将所有XML文件中的信息存储在Excel工作表中。

编辑:我想知道如何获取单个文件夹中存在的所有.xml文件的这些详细信息,并将这些信息作为列以及这些xml标记中的相应值写入单个csv中。

<?xml version="1.0" encoding="utf-8"?>
<eCitation>
    <meta type="rr"/>
    <uId id="CXR49"/>
    <pmcId id="49"/>
    <docSource>CXR</docSource>
    <IUXRId id="49"/>
    <licenseType>open-access</licenseType>
    <licenseURL>http://creativecommons.org/licenses/by-nc-nd/4.0/</licenseURL>
    <ccLicense>byncnd</ccLicense>
    <articleURL/>
    <articleDate>2013-08-01</articleDate>
    <articleType>XR</articleType>
    <publisher>Indiana University</publisher>
    <title>Indiana University Chest X-ray Collection</title>
    <note>The data are drawn from multiple hospital systems.</note>
    <specialty>pulmonary diseases</specialty>
    <subset>CXR</subset>
    <MedlineCitation Owner="Indiana University" Status="supplied by publisher">
        <Article PubModel="Electronic">
            <Journal>
                <JournalIssue>
                    <PubDate>
                        <Year>2013</Year>
                        <Month>08</Month>
                        <Day>01</Day>
                    </PubDate>
                </JournalIssue>
            </Journal>
            <ArticleTitle>Indiana University Chest X-ray Collection
</ArticleTitle>
            <Abstract>
                <AbstractText Label="COMPARISON">None.
</AbstractText>
                <AbstractText Label="INDICATION">XXXX-year-old with
osteoarthritis of the hip scheduled for total hip replacement.
Preoperative evaluation.
</AbstractText>
                <AbstractText Label="FINDINGS">The heart, pulmonary XXXX and
mediastinum are within normal limits. There is no pleural
effusion or pneumothorax. There is no focal air space opacity to
suggest a pneumonia. There are degenerative changes of the
thoracic spine. There is a calcified granuloma identified in the
right suprahilar region. The aorta is mildly tortuous and
ectatic. There is asymmetric right apical smooth pleural
thickening. There are severe degenerative changes of the XXXX.
</AbstractText>
                <AbstractText Label="IMPRESSION">No acute cardiopulmonary
disease.
</AbstractText>
            </Abstract>
            <Affiliation>Indiana University</Affiliation>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Kohli</LastName>
                    <ForeName>Marc</ForeName>
                    <Initials>MD</Initials>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Rosenman</LastName>
                    <ForeName>Marc</ForeName>
                    <Initials>M</Initials>
                </Author>
            </AuthorList>
            <Language>eng</Language>
            <PublicationTypeList>
                <PublicationType>Radiology Report</PublicationType>
            </PublicationTypeList>
            <ArticleDate>
                <Year>2013</Year>
                <Month>08</Month>
                <Day>01</Day>
            </ArticleDate>
        </Article>
        <EssieArticleTitle>Indiana University Chest X-ray                 Collection</EssieArticleTitle>
        <IMedAuthor>Marc David Kohli MD</IMedAuthor>
        <IMedAuthor>Marc Rosenman M</IMedAuthor>
    </MedlineCitation>
    <MeSH>
        <major>Thoracic Vertebrae/degenerative</major>
        <major>Calcified Granuloma/lung/hilum/right</major>
        <major>Aorta/tortuous/mild</major>
        <major>Thickening/pleura/apex/right</major>
        <automatic>calcified granuloma</automatic>
        <automatic>degenerative change</automatic>
        <automatic>pleural thickening</automatic>
    </MeSH>
    <parentImage id="CXR49_IM-2110-1001">
        <figureId>F1</figureId>
        <caption>PA and lateral chest radiographs dated XXXX at XXXX hours.
</caption>
        <panel type="single">
            <url>/hadoop/storage/radiology/extract/CXR49_IM-2110-1001.jpg</url>
            <imgModality>7</imgModality>
            <region type="panel">
                <globalImageFeatures>
                    <CEDD>f2p0k1205</CEDD>
                    <ColorLayout>f1p0k137</ColorLayout>
                    <EdgeHistogram>f0p0k184</EdgeHistogram>
                    <FCTH>f4p0k2450</FCTH>
                    <SemanticContext60>f3p0k74</SemanticContext60>
                </globalImageFeatures>
            </region>
        </panel>
    </parentImage>
    <parentImage id="CXR49_IM-2110-2001">
        <figureId>F2</figureId>
        <caption>PA and lateral chest radiographs dated XXXX at XXXX hours.            </caption>
        <panel type="single">
            <url>/hadoop/storage/radiology/extract/CXR49_IM-2110-2001.jpg</url>
            <imgModality>7</imgModality>
            <region type="panel">
                <globalImageFeatures>
                    <CEDD>f2p0k710</CEDD>
                    <ColorLayout>f1p0k83</ColorLayout>
                    <EdgeHistogram>f0p0k1200</EdgeHistogram>
                    <FCTH>f4p0k369</FCTH>
                    <SemanticContext60>f3p0k18</SemanticContext60>
                </globalImageFeatures>
            </region>
        </panel>
    </parentImage>
</eCitation>

1 个答案:

答案 0 :(得分:0)

假设xmldoc.txt中存在您的xml,以下脚本将以list的形式获取所需的值。您可以根据自己的需要修改代码。

from lxml import etree
import pandas as pd
df = pd.DataFrame(columns=['X', 'Y', 'Z', 'W'])#change it to what you want
for f in xmlfiles: # xmlfiles is a list of strings where each one is the absolute path to your xml files
    tree = etree.parse(f)
    findings = tree.find(".//AbstractText[@Label='FINDINGS']")
    impression = tree.find(".//AbstractText[@Label='IMPRESSION']")
    parentimages = tree.findall(".//parentImage")
    x = [findings,impression]
    y = [i.text for i in x]
    z = [i.attrib['id'] for i in parentimages]
    y.extend(z)
    df.loc[len(df)] = y
df.to_csv('everything.csv',header=TRUE)