Question

我希望从sec.gov网站上删除大量文件，到目前为止一切顺利。问题是旧文件采用.txt格式，没有任何真正的HTML格式。有没有办法使用Python从这些文件中获取信息？

我有大约30,000个这样的人要做，旧文件是我老板真正想要的......我目前正在使用BeautifulSoup4进行其他格式正确的擦除。

提前致谢！

Answer 1

如果您能够获取文本文件，则只需要解析基本文本文件。

这样的事情对你的目的应该没问题： http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python

具体来说，要打开本地文件，您可以使用以下内容：

file = open("newfile.txt", "r")

第一个参数是文件的名称，第二个参数是您要打开文件的模式（“r”代表读取）。然后，您可以使用各种方法，如file.read（），file.readline（）或file.readlines（）来从文本文件中获取字符。

如果您想专门阅读文本文件中的文字，请查看Reading a text file and splitting it into single words in python。答案显示了如何遍历文本文件中与python脚本位于同一目录中的所有单词。

with open('words.txt','r') as f:
    for line in f:
        for word in line.split():
           print(word)

如果您没有本地下载的文件但是您有网址，这也可以帮助您：In Python, given a URL to a text file, what is the simplest way to read the contents of the text file?

您正在寻找的链接中的具体部分是：

import urllib2  # the lib that handles the url stuff

data = urllib2.urlopen(target_url) # it's a file like object and works just like a file

Answer 2

在这个使用urllib.request获取文件和lxml来解析的具体示例中：

import urllib.request
broken_xml = urllib.request.urlopen('https://www.sec.gov/Archives/edgar/data/20/000089322004000596/w93059exv31w1.txt').read().decode('utf-8')
from lxml import etree
from io import StringIO
tree = etree.parse(StringIO(broken_xml), parser = etree.XMLParser(encoding='utf-8', recover=True))
tree.xpath('//SEQUENCE/text()')
# ['7\n']
tree.xpath('//FILENAME/text()')
# ['w93059exv31w1.txt\n']

web使用python

2 个答案: