如何从文本文件中删除奇怪的编码

时间:2015-02-19 11:08:13

标签: python html text beautifulsoup xbrl

我正在尝试处理像这样的文本文件:

http://www.sec.gov/Archives/edgar/data/789019/000119312514289961/0001193125-14-289961.txt

如果您看到文件的中间位置,则会出现以下内容:

</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>EXCEL
<SEQUENCE>21
<FILENAME>Financial_Report.xlsx
<DESCRIPTION>IDEA: XBRL DOCUMENT
<TEXT>
begin 644 Financial_Report.xlsx
M4$L#!!0`!@`(````(0!):[_C#0,``+!)```3``@"6T-O;G1E;G1?5'EP97-=
M+GAM;""B!`(HH``"````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M````````````````````````````````````````````````````````````
M``````````````````````````````````````#,W,M.VT`4QO%]I;Z#Y6V5
M>([OK@@L>EFV2*4/,+4GQ,(W>08*;]^)N0BA%(2*U/^&B,2>\\6+G[+YSM')
M==\%5V:V[3AL0EFK,#!#/3;M<+X)?YY]795A8)T>&MV-@]F$-\:&)\?OWQV=
MW4S&!O[NP6["G7/3QRBR]<[TVJ['R0S^D^TX]]KY?^?S:-+UA3XW4:Q4'M7C
MX,S@5FY_1GA\]-EL]67G@B_7_NW;)+/I;!A\NKUP/VL3ZFGJVEH[GS2Z&IHG
M4U9W$];^SN4:NVLG^\''"*.#$_:?_'W`W7W?_:.9V\8$IWIVWW3O8T377?1[
MG"]^C>/%^OE##J0<M]NV-LU87_;^":SM-!O=V)TQKN_6R^NZU^UPG_N9^<O%
M-EI>Y(V#[+_?<O`K<\20'`DD1PK)D4%RY)`<!21'"<E107*(H@2AB"H44H5B
MJE!0%8JJ0F%5**X*!5:AR!I39(TILL8466.*K#%%UI@B:TR1-:;(&E-DC2FR
M)A19$XJL"476A")K0I$UH<B:4&1-*+(F%%D3BJPI1=:4(FM*D36ER)I29$TI
MLJ8465.*K"E%UI0B:T:1-:/(FE%DS2BR9A19,XJL&476C")K1I$UH\B:4V3-
M*;+F%%ESBJPY1=:<(FM.D36GR)I39,TILA8460N*K`5%UH(B:T&1M:#(6E!D
M+2BR%A19"XJL)476DB)K29&UI,A:4F0M*;*6%%E+BJPE1=:2(FM%D;6BR%I1
M9*THLE8462N*K!5%UHHB:T61M:+(*HI"JRB*K:(HN(JBZ"J*PJLHBJ^B*,"*
MH@@KBD*L*(RQH#H6QEA.(8O3R.)4LCB=+$XIB]/*XM2R,+TLP12S!-/,$DPU
M2S#=+,&4LP33SA),/4LP_2S!%+0$T]"2_U;1<GX?CHF6O__^`W8YYH6%+-;=
M=,:^\1*%VT-?FKS3LVE^N-EO#GKS`(_/?BZ'WZMS.H^3]1N&9O/ZIW"_0FA_
M]VKR!YG9M>9AB="A93P/$_UVHM</?+(-R.SW'S6F.3`[6O8M'?\!``#__P,`
M4$L#!!0`!@`(````(0"U53`C]0```$P"```+``@"7W)E;',O+G)E;',@H@0"

这好像是一个excel文件?还是XBRL文件?那是什么 ?我如何摆脱它(或以某种方式“处理”它?)这继续成千上万行,所以我猜它是一些附加文件的某些链接的编码?知道怎么处理吗?

我正在尝试在Python中使用BeautifulSoup:

from bs4 import BeautifulSoup

with open("textWithHtml.txt") as markup:
    soup = BeautifulSoup(markup.read())

with open("processedText.txt", "w") as f: 
    f.write(soup.get_text().encode('utf-8'))

但并非所有内容都被删除,而且我注意到在某些情况下甚至没有删除所有html标签..有时运行代码两次删除的内容比第一次运行BeautifulSoup代码时删除的内容更多..

2 个答案:

答案 0 :(得分:1)

您正在查看的编码是uuencode。在Python中,您可以使用uu模块对此blob进行解码,或者只使用stringdata.decode('uu')

uuencode是一种遗留格式,最初用于在电子邮件中嵌入二进制文件(当时只允许使用7位US-ASCII;该格式对于当时的大铁系统的互操作性也有一些让步。使用了他们自己令人困惑的字符编码)。现在,您希望在此角色中看到base64

I posted an answer to the followup question展示了如何在从文件句柄中读取或迭代一堆文本行时删除uuencode blob。

答案 1 :(得分:0)

使用此处提供的sed命令可以有效地解决问题:sed command - apply in all text (.txt) files of folder