我已经挣扎了几个星期了。我正在尝试为公司提供10K年度报表。我已经从SEC的FTP服务器下载了该文件,这就是10K的样子。它是一个HTML文件。所以我编写了以下代码将其转换为文本:
actvtxt=open("C:\\Users\\Downloads\\10Ks\\AbraxasPetroleum10K.txt",'r')
txt=actvtxt.readlines()
ind=txt.index('<DOCUMENT>\n')
txt=txt[ind:]
x=(str.join('\n',map(str,txt)))
soup=BeautifulSoup(x.encode('utf-8'))
for script in soup(["script", "style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
获取文本后,我需要提取
的文本&n;项目7:管理层对财务状况和经营业绩的讨论和分析&#39;。
此链接可让您了解我所谈论的内容:actual 10K
我尝试了以下代码:
substr=re.search(r'The following is a discussion of our consolidated financial condition(\s+|\w+|[#!\"#$%&\'()*+,-./:;<=>?@^_`{|}]){1,}',text)
substr.group(0)
但这只是给了我段落的开头:
u'The following is a discussion of our consolidated financial condition, results of operations, liquidity and capital resources. This discussion excludes the operations of Blue Eagle, except our equity share of Blue Eagle'
非常感谢任何帮助。