我对python,BS4和Lxml解析器都是陌生的。
我正在尝试从XML邮政编码标记中删除最后三个字符以匿名化数据。
当前代码可以正常运行,没有任何错误,但不会从输出的XML文件中删除最后三位数字。
XML MOCK数据-
<?xml version="1.0" encoding="UTF-8"?>
<!-- Please note that this file is properly formed, and serves as an example of a file that will load into the ILR DC system. The data is anonymised and does not refer to a real-world provider, learning delivery or learner. Based on the ILR specification, version 2, dated April 2018-->
<Message xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="ESFA/ILR/2018-19" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ESFA/ILR/2018-19">
<Header>
<CollectionDetails>
<Collection>ILR</Collection>
<Year>1819</Year>
<FilePreparationDate>2018-01-07</FilePreparationDate>
</CollectionDetails>
<Source>
<ProtectiveMarking>OFFICIAL-SENSITIVE-Personal</ProtectiveMarking>
<UKPRN>99999999</UKPRN>
<SoftwareSupplier>SupplierName</SoftwareSupplier>
<SoftwarePackage>SystemName</SoftwarePackage>
<Release>1</Release>
<SerialNo>01</SerialNo>
<DateTime>2018-06-26T11:14:05</DateTime>
<!-- This and the next element only appear in files generated by FIS -->
<ReferenceData>Version5.0, LARS 2017-08-01</ReferenceData>
<ComponentSetVersion>1</ComponentSetVersion>
</Source>
</Header>
<SourceFiles>
<!-- The SourceFiles group only appears in files generated by FIS -->
<SourceFile>
<SourceFileName>ILR-LLLLLLLL1819-20180626-144401-01.xml</SourceFileName>
<FilePreparationDate>2018-06-26</FilePreparationDate>
<SoftwareSupplier>Software Systems Inc.</SoftwareSupplier>
<SoftwarePackage>GreatStuffMIS</SoftwarePackage>
<Release>1</Release>
<SerialNo>01</SerialNo>
<DateTime>2018-06-26T11:14:05</DateTime>
</SourceFile>
</SourceFiles>
<LearningProvider>
<UKPRN>99999999</UKPRN>
</LearningProvider>
<!-- 16 yr old learner undertaking full time 16-19 (excluding apprenticeships) funded programme -->
<Learner>
<LearnRefNumber>16Learner</LearnRefNumber>
<PMUKPRN>87654321</PMUKPRN>
<CampId>1234ABCD</CampId>
<ULN>1061484016</ULN>
<FamilyName>Smith</FamilyName>
<GivenNames>Jane</GivenNames>
<DateOfBirth>1999-02-27</DateOfBirth>
<Ethnicity>31</Ethnicity>
<Sex>F</Sex>
<LLDDHealthProb>2</LLDDHealthProb>
<Accom>5</Accom>
<PlanLearnHours>440</PlanLearnHours>
<PlanEEPHours>100</PlanEEPHours>
<MathGrade>NONE</MathGrade>
<EngGrade>D</EngGrade>
<PostcodePrior>BR1 7SS</PostcodePrior>
<Postcode>BR1 7SS</Postcode>
<AddLine1>The Street</AddLine1>
<AddLine2>ToyTown</AddLine2>
<LearnerFAM>
<LearnFAMType>LSR</LearnFAMType>
<LearnFAMCode>55</LearnFAMCode>
</LearnerFAM>
<LearnerFAM>
<LearnFAMType>EDF</LearnFAMType>
<LearnFAMCode>2</LearnFAMCode>
</LearnerFAM>
<LearnerFAM>
<LearnFAMType>MCF</LearnFAMType>
<LearnFAMCode>3</LearnFAMCode>
</LearnerFAM>
<LearnerFAM>
<LearnFAMType>FME</LearnFAMType>
<LearnFAMCode>2</LearnFAMCode>
</LearnerFAM>
<LearnerFAM>
<LearnFAMType>PPE</LearnFAMType>
<LearnFAMCode>2</LearnFAMCode>
</LearnerFAM>
当前代码:
#Importing BS4#
from bs4 import BeautifulSoup
#Opening Origional XML File, Setting soup to BS#
with open("ILR_mock_data.xml", "r") as infile:
xml_text = infile.read()
soup = BeautifulSoup(xml_text, 'xml')
#Postcode (Deleting last 3 digits)#
for postcode_tag in soup.find_all("Postcode"):
postcode_tag.string[:-3]
with open("SEND_ME_TO_RCU.xml", "w") as outfile:
outfile.write(soup.prettify())
希望XML包含
<Postcode>BR1 7SS</Postcode>
新邮政编码为
<Postcode>BR1</Postcode>
答案 0 :(得分:0)
使用
解决了问题for pripostcode_tag in soup.find_all("PostcodePrior"):
pripostcode_tag.string = pripostcode_tag.string[:-3]
答案 1 :(得分:0)
下面的代码使用xml的简化版本(,但也应与OP的xml一起使用)。它不使用任何外部库。
public static DateTime? ToInternal(string source)
{
if (!DateTime.TryParseExact(
source,
PUBLIC_INPUT_FORMAT_STRING,
System.Globalization.CultureInfo.InvariantCulture,
System.Globalization.DateTimeStyles.None,
out DateTime date))
{
return null;
}
else
{
return date;
}
}
public static string[] PUBLIC_INPUT_FORMAT_STRING =
{
"yyyy-MM-dd", "dd/M/yyyy", "d/M/yyyy", "d/MM/yyyy",
"dd/MM/yy", "dd/M/yy", "d/M/yy", "d/MM/yy","yyyy/MM/dd",
"yyyy-MM-dd HH:mm tt","yyyy'-'MM'-'dd'T'HH':'mm':'ss",
"dd/M/yyyy HH:mm:ss tt", "d/MM/yyyy H:mm:ss tt",
"d/M/yyyy HH:mm:ss", "d/MM/yyyy HH:mm:ss","dd/MM/yyyy HH:mm:ss"
};
输出
import xml.etree.ElementTree as ET
xml_sample = '''<r><Postcode>ACBDEF</Postcode></r>'''
root = ET.fromstring(xml_sample)
post_codes = root.findall('.//Postcode')
for pc in post_codes:
pc.text = pc.text[:-3]
ET.dump(root)