Question

我有一个脚本可以从SEC的EDGAR数据库下载文本块数据。准确地提取数据。但是，文本包含多个连续空格（x20）和CRLF（x0A xOD）。

我需要能够删除逗号和多余的CRLF和空格，然后将整个文本内容写入CSV文件以供日后分析。

我不是python程序员，但我使用python执行此任务，因为XBRL解析程序有一个python接口。

我需要为大约6,000个单独的观察执行此任务，因此我不想手动尝试。

我已经进行了广泛的搜索，包括购买和阅读两本python教科书，但在尝试将文本写入CSV文件之前，我无法确定如何编辑文本。

以下是在写入文件之前的原始数据的代表性打印输出。请注意，应该有5个逗号分隔的字段，日期写入单个单元格后的所有内容。

DocumentType EntityName CIK PeriodEndDate PPE_Policy 10-K CONOLOG CORP 23503 7/31/2012

物业和设备

物业和设备按成本

运输

                  less allowances for depreciation. Depreciation is computed by

                  the straight-line method over the estimated useful lives of

                  the assets which range between three (3) and thirty-nine(39)

                  years. Depreciation was $16,560 and $14,598 for the fiscal 

                  years ended July 31 2012 and 2011 respectively. Repairs and

                  maintenance expenditures which do not extend the useful lives

                  of the related assets are expensed as incurred. Gains and

                  losses on depreciable assets retired or sold are recognized

                  in the consolidated statement of operations in the year of

                  disposal</font></p>

Answer 1

我不确定您已尝试过什么，但如果您下载文档并将其分配给变量，则可以对该文档执行字符串操作。例如（在pseduo-python中）：

doc = downloaded_xbrl
edited_doc = doc.replace('\x20','') --removes x20, replaces with nothing
csv.write(edited_doc)

指向python docs的链接：https://docs.python.org/2/library/string.html#string-formatting

在将文本写入文件之前，如何使用python编辑内存中的文本块？

1 个答案: