我有一个脚本可以从SEC的EDGAR数据库下载文本块数据。准确地提取数据。但是,文本包含多个连续空格(x20)和CRLF(x0A xOD)。
我需要能够删除逗号和多余的CRLF和空格,然后将整个文本内容写入CSV文件以供日后分析。
我不是python程序员,但我使用python执行此任务,因为XBRL解析程序有一个python接口。
我需要为大约6,000个单独的观察执行此任务,因此我不想手动尝试。
我已经进行了广泛的搜索,包括购买和阅读两本python教科书,但在尝试将文本写入CSV文件之前,我无法确定如何编辑文本。
以下是在写入文件之前的原始数据的代表性打印输出。请注意,应该有5个逗号分隔的字段,日期写入单个单元格后的所有内容。
DocumentType EntityName CIK PeriodEndDate PPE_Policy 10-K CONOLOG CORP 23503 7/31/2012
物业和设备
运输
less allowances for depreciation. Depreciation is computed by
the straight-line method over the estimated useful lives of
the assets which range between three (3) and thirty-nine(39)
years. Depreciation was $16,560 and $14,598 for the fiscal
years ended July 31 2012 and 2011 respectively. Repairs and
maintenance expenditures which do not extend the useful lives
of the related assets are expensed as incurred. Gains and
losses on depreciable assets retired or sold are recognized
in the consolidated statement of operations in the year of
disposal</font></p>
答案 0 :(得分:0)
我不确定您已尝试过什么,但如果您下载文档并将其分配给变量,则可以对该文档执行字符串操作。例如(在pseduo-python中):
doc = downloaded_xbrl
edited_doc = doc.replace('\x20','') --removes x20, replaces with nothing
csv.write(edited_doc)
指向python docs的链接:https://docs.python.org/2/library/string.html#string-formatting