使用请求库,我从SEC.gov网站上抓取了一行文本,用于一个个人项目。我收到一个错误消息,因为在达到正确的行之前,我试图分配给变量。我看到了几乎相同的问题。
How to make python disregard first couple of lines of a text file
但是,我希望程序能够确定要跳过多少行,而不是对其进行硬编码。
我对其进行了硬编码,但我相信行数可能会发生变化。另外,我认为我可以检查行中的定界符(|),如果行中没有定界符(|),则将该行扔掉,但这将意味着检查大量的字符。代码:
try:
for year in range(start_year, current_year + 1):
url = 'https://www.sec.gov/Archives/edgar/full-index/%s/%s/master.idx' %(year, quarter)
r = requests.get(url)
lines = r.text.splitlines(True)
for line in lines[12:]:
# cik, company_name, filling_type, filling_date, edgar_url = line.split('|')
# if cik == 729986:
# print(line)
无论如何,有没有让python尝试执行赋值操作,如果抛出错误,就用这种方式扔掉它?说,
try:
cik, company_name, filling_type, filling_date, edgar_url = line.split('|')
except Exception as e:
continue
这里有两行,如果我可以正确地跳过这一行,则返回该行,就像硬编码一样:
72971|WELLS FARGO & COMPANY/MN|SC 13G|2019-02-14|edgar/data/72971/0000072971-19-000222.txt
729986|UNITED BANKSHARES INC/WV|10-K|2019-03-01|edgar/data/729986/0001193125-19-060795.txt
但是,我认为前14行描述了以下数据:
Retrieved from: SEC.gov, Tuesday April 9th, 2019
Email: ########.gov
这会使以下行失败:
cik, company_name, filling_type, filling_date, edgar_url = line.split('|')
我最终编码的解决方案基于此处标记为正确的解决方案,因为我认为这与我最初的想法最一致。当我继续开发这个项目时,所有的答案都给了我一些思考。我认为每种解决方案都有好处。 这是我的最终代码,请随时对其进行评论:
try:
for year in range(start_year, current_year + 1):
url = 'https://www.sec.gov/Archives/edgar/full-index/%s/%s/master.idx' %(year, quarter)
r = requests.get(url)
lines = r.text.splitlines(True)
for line in lines[0:]:
row = line.split('|')
if len(row) == 5:
cik, company_name, filling_type, filling_date, edgar_url = row[0:5]
except requests.exceptions.HTTPError as e:
print(e)
超级编辑: 有没有一种方法可以解决一行问题:
df = pd.DataFrame([line.split('|') for line in lines if len(line.split('|') == 4])
# I think this calls the split function twice though which might be finicky.
会问新问题。
答案 0 :(得分:1)
import re
import requests
import pandas as pd
def get_data(url):
r = requests.get(url)
r.raise_for_status()
# Find the csv header
m1 = re.search("\\n(\w\s*\|?)+\\n", r.text)
# Find end of dash line starting from end of header
start = r.text.find("\n", m1.end()) + 1
# r.text[start:] is the part of the text after the initial header
# Get individual lines
lines = r.text[start:].splitlines()
# If you have Pandas, you can pack everything into a nice DataFrame
cols = m1.group().strip().split('|')
df = pd.DataFrame([line.split('|') for line in lines], columns=cols)
return df
url = 'https://www.sec.gov/Archives/edgar/full-index/2019/QTR1/master.idx'
df = get_data(url)
df.head()
给予
CIK Company Name Form Type Date Filed Filename
0 1000045 NICHOLAS FINANCIAL INC 10-Q 2019-02-14 edgar/data/1000045/0001193125-19-039489.txt
1 1000045 NICHOLAS FINANCIAL INC 4 2019-01-15 edgar/data/1000045/0001357521-19-000001.txt
2 1000045 NICHOLAS FINANCIAL INC 4 2019-02-19 edgar/data/1000045/0001357521-19-000002.txt
3 1000045 NICHOLAS FINANCIAL INC 4 2019-03-15 edgar/data/1000045/0001357521-19-000003.txt
4 1000045 NICHOLAS FINANCIAL INC 8-K 2019-02-01 edgar/data/1000045/0001193125-19-024617.txt
答案 1 :(得分:1)
您总共希望有5列。忽略每行不包含5列的行。
import requests
def get_index(year, quarter):
url = 'https://www.sec.gov/Archives/edgar/full-index/%s/%s/master.idx' % (year, quarter)
r = requests.get(url)
for line in r.text.splitlines():
row = line.split('|')
if len(row) == 5:
yield row
rows = get_index(2018, 'QTR1')
next(rows) # skip header row
for i, row in enumerate(rows):
print(row)
if i > 10:
break
答案 2 :(得分:0)
您可以只查找“-”行,然后在其后排
import requests
import pandas as pd
url = 'https://www.sec.gov/Archives/edgar/full-index/2018/QTR1/master.idx'
r = requests.get(url).text
records = r.splitlines()
results = []
header = 'CIK|Company Name|Form Type|Date Filed|Filename'
found = False
for row in records:
if found:
results.append(row.split('|'))
if not found and set(row.strip()) == {'-'}:
found = True
df = pd.DataFrame(results, columns = header.split('|') )
print(df.head())