我正在使用BeautifulSoup和python 3.4来解析xml文件。我的代码如下:
from lxml import etree
from bs4 import BeautifulSoup
import psycopg2
import sys
import configparser
import re
with open(informationFileName) as infoFP:
infoTableSoup = BeautifulSoup(infoFP, "xml")
infoTableSoupFileName.write(infoTableSoup.prettify())
informationFileName包含以下文本行:
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?><informationTable xmlns:ns1="http://www.sec.gov/edgar/common" xmlns:ns11="http://www.sec.gov/edgar/statecodes" xmlns="http://www.sec.gov/edgar/document/thirteenf/informationtable" xsi:schemaLocation="http://www.sec.gov/edgar/common eis_Common.xsd http://www.sec.gov/edgar/statecodes eis_stateCodes.xsd http://www.sec.gov/edgar/document/thirteenf/informationtable eis_13FDocument.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><infoTable><nameOfIssuer>1ST SOURCE CORP</nameOfIssuer><titleOfClass>COM</titleOfClass><cusip>336901103</cusip><value>166</value><shrsOrPrnAmt><sshPrnamt>3364</sshPrnamt><sshPrnamtType>SH</sshPrnamtType></shrsOrPrnAmt><investmentDiscretion>SOLE</investmentDiscretion><otherManager>1</otherManager><votingAuthority><Sole>3364</Sole><Shared>0</Shared><None>0</None></votingAuthority></infoTable></informationTable>
</XML>
</TEXT>
</DOCUMENT>
</SEC-DOCUMENT>
遗憾的是,所有文字都在一行上。当我运行这个程序时,美化的xml被截断如下:
<?xml version="1.0" encoding="utf-8"?>
<informationTable xmlns:ns1="htt"/>
但是,如果我在运行到下面之前格式化xml,那么它会被正确地美化。
<?xml version="1.0" encoding="ISO-8859-1" standalone="no"?><informationTable xmlns:ns1="http://www.sec.gov/edgar/common" xmlns:ns11="http://www.sec.gov/edgar/statecodes" xmlns="http://www.sec.gov/edgar/document/thirteenf/informationtable" xsi:schemaLocation="http://www.sec.gov/edgar/common eis_Common.xsd http://www.sec.gov/edgar/statecodes eis_stateCodes.xsd http://www.sec.gov/edgar/document/thirteenf/informationtable eis_13FDocument.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<infoTable>
<nameOfIssuer>1ST SOURCE CORP</nameOfIssuer>
<titleOfClass>COM</titleOfClass>
<cusip>336901103</cusip>
<value>166</value>
<shrsOrPrnAmt>
<sshPrnamt>3364</sshPrnamt>
<sshPrnamtType>SH</sshPrnamtType>
</shrsOrPrnAmt>
<investmentDiscretion>SOLE</investmentDiscretion>
<otherManager>1</otherManager>
<votingAuthority>
<Sole>3364</Sole>
<Shared>0</Shared>
<None>0</None>
</votingAuthority>
</infoTable>
</informationTable>
</XML> </TEXT> </DOCUMENT> </SEC-DOCUMENT>
为什么会这样?除了添加一些换行符之外我什么也没做。我认为BeautifulSoup不应该关心文件的格式。