将Python与bs4(Lxml)结合使用,在XML标签内编辑文本

时间:2019-07-10 10:53:57

标签: python xml beautifulsoup lxml

我对python,BS4和Lxml解析器都是陌生的。

我正在尝试从XML邮政编码标记中删除最后三个字符以匿名化数据。

当前代码可以正常运行,没有任何错误,但不会从输出的XML文件中删除最后三位数字。

XML MOCK数据-

<?xml version="1.0" encoding="UTF-8"?>
<!-- Please note that this file is properly formed, and serves as an example of a file that will load into the ILR DC system.  The data is anonymised and does not refer to a real-world provider, learning delivery or learner.  Based on the ILR specification, version 2, dated April 2018-->
<Message xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="ESFA/ILR/2018-19" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ESFA/ILR/2018-19">
    <Header>
        <CollectionDetails>
            <Collection>ILR</Collection>
            <Year>1819</Year>
            <FilePreparationDate>2018-01-07</FilePreparationDate>
        </CollectionDetails>
        <Source>
            <ProtectiveMarking>OFFICIAL-SENSITIVE-Personal</ProtectiveMarking>
            <UKPRN>99999999</UKPRN>
            <SoftwareSupplier>SupplierName</SoftwareSupplier>
            <SoftwarePackage>SystemName</SoftwarePackage>
            <Release>1</Release>
            <SerialNo>01</SerialNo>
            <DateTime>2018-06-26T11:14:05</DateTime>
            <!-- This and the next element only appear in files generated by FIS -->
            <ReferenceData>Version5.0, LARS 2017-08-01</ReferenceData>
            <ComponentSetVersion>1</ComponentSetVersion>
        </Source>
    </Header>
    <SourceFiles>
        <!-- The SourceFiles group only appears in files generated by FIS -->
        <SourceFile>
            <SourceFileName>ILR-LLLLLLLL1819-20180626-144401-01.xml</SourceFileName>
            <FilePreparationDate>2018-06-26</FilePreparationDate>
            <SoftwareSupplier>Software Systems Inc.</SoftwareSupplier>
            <SoftwarePackage>GreatStuffMIS</SoftwarePackage>
            <Release>1</Release>
            <SerialNo>01</SerialNo>
            <DateTime>2018-06-26T11:14:05</DateTime>
        </SourceFile>
    </SourceFiles>
    <LearningProvider>
        <UKPRN>99999999</UKPRN>
    </LearningProvider>
    <!-- 16 yr old learner undertaking full time 16-19 (excluding apprenticeships) funded programme -->
    <Learner>
        <LearnRefNumber>16Learner</LearnRefNumber>
        <PMUKPRN>87654321</PMUKPRN>
        <CampId>1234ABCD</CampId>
        <ULN>1061484016</ULN>
        <FamilyName>Smith</FamilyName>
        <GivenNames>Jane</GivenNames>
        <DateOfBirth>1999-02-27</DateOfBirth>
        <Ethnicity>31</Ethnicity>
        <Sex>F</Sex>
        <LLDDHealthProb>2</LLDDHealthProb>
        <Accom>5</Accom>
        <PlanLearnHours>440</PlanLearnHours>
        <PlanEEPHours>100</PlanEEPHours>
        <MathGrade>NONE</MathGrade>
        <EngGrade>D</EngGrade>
        <PostcodePrior>BR1 7SS</PostcodePrior>
        <Postcode>BR1 7SS</Postcode>
        <AddLine1>The Street</AddLine1>
        <AddLine2>ToyTown</AddLine2>
        <LearnerFAM>
            <LearnFAMType>LSR</LearnFAMType>
            <LearnFAMCode>55</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>EDF</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>MCF</LearnFAMType>
            <LearnFAMCode>3</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>FME</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>
        <LearnerFAM>
            <LearnFAMType>PPE</LearnFAMType>
            <LearnFAMCode>2</LearnFAMCode>
        </LearnerFAM>

当前代码:

#Importing BS4# 
from bs4 import BeautifulSoup

#Opening Origional XML File, Setting soup to BS# 
with open("ILR_mock_data.xml", "r") as infile:
    xml_text = infile.read()

soup = BeautifulSoup(xml_text, 'xml')




#Postcode (Deleting last 3 digits)#
for postcode_tag in soup.find_all("Postcode"):
    postcode_tag.string[:-3]


with open("SEND_ME_TO_RCU.xml", "w") as outfile:
    outfile.write(soup.prettify())

希望XML包含

<Postcode>BR1 7SS</Postcode>

新邮政编码为

<Postcode>BR1</Postcode>

2 个答案:

答案 0 :(得分:0)

使用

解决了问题
for pripostcode_tag in soup.find_all("PostcodePrior"):   
    pripostcode_tag.string = pripostcode_tag.string[:-3]

答案 1 :(得分:0)

下面的代码使用xml的简化版本(,但也应与OP的xml一起使用)。它不使用任何外部库。

public static DateTime? ToInternal(string source)
{
    if (!DateTime.TryParseExact(
           source,
           PUBLIC_INPUT_FORMAT_STRING,
           System.Globalization.CultureInfo.InvariantCulture,
           System.Globalization.DateTimeStyles.None,
           out DateTime date))
    {
        return null;
    }
    else
    {
        return date;
    }
}

public static string[] PUBLIC_INPUT_FORMAT_STRING =
{
    "yyyy-MM-dd", "dd/M/yyyy", "d/M/yyyy", "d/MM/yyyy",
    "dd/MM/yy", "dd/M/yy", "d/M/yy", "d/MM/yy","yyyy/MM/dd",
    "yyyy-MM-dd HH:mm tt","yyyy'-'MM'-'dd'T'HH':'mm':'ss", 
    "dd/M/yyyy HH:mm:ss tt", "d/MM/yyyy H:mm:ss tt",
    "d/M/yyyy HH:mm:ss", "d/MM/yyyy HH:mm:ss","dd/MM/yyyy HH:mm:ss"
};

输出

import xml.etree.ElementTree as ET

xml_sample = '''<r><Postcode>ACBDEF</Postcode></r>'''
root = ET.fromstring(xml_sample)
post_codes = root.findall('.//Postcode')
for pc in post_codes:
  pc.text = pc.text[:-3]
ET.dump(root)