所以我试图转储这个网站上的所有报告:
https://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm
州:所有国家(不是里德法案福利或里德法案管理员)
报告:交易声明
月:所有月份
年:全年
查看网站的源代码,我知道状态变量:
<form action="get" name="UtfReport">
<fieldset>
<table>
<tr>
<td>
<label for="states">State</label><br />
<select name="states" id="states" size="01">
<option value="al" selected>Alabama</option>
<option value="b2">Alabama Reed Act Benefit</option>
<option value="b3">Alabama Reed Act Admin</option>
<option value="ak">Alaska</option>
<option value="a2">Alaska Reed Act Benefit</option>
所以我知道我需要创建一个像这样的字符串列表
https://www.treasurydirect.gov/govt/reports/tfmp/utf/[a1]/dfiw00[116]tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/[a1]/dfiw00[216]tsar.txt
...
所以这是我目前的做法:
import requests, bs4
for i in range(1,13):
print('https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw0'+str(i).zfill(2),'16tsar.txt')
res = requests.get('https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw00216tsar.txt')
res.raise_for_status()
states = bs4.BeautifulSoup(res.text, 'lxml')
result1.append(res.text)
我创建url字符串的努力也遇到了问题,因为这是上面代码的输出(dfiw00X和16tsar.txt之间有一个空格,我不知道为什么):< / p>
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw001 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw002 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw003 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw004 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw005 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw006 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw007 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw008 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw009 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw010 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw011 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw012 16tsar.txt
所以我的问题是:必须有一种比我目前正在尝试的方式更好的方法,所以如果有人能告诉我如何,我会非常感激。
感谢您的时间,
答案 0 :(得分:3)
您需要进行一些硬编码,请求与utfnav.js中的代码放在一起,我们感兴趣的主要部分如下:
//assembles path to reports
ReportPath = "/govt/reports/tfmp/utf/"+StateName+"/dfi";
LinkData = (ReportPath+WeekName+MonthName+YearName+ReportName+StateName+".txt");
return true;
}
}
else
{
//displays when dates are not valid for report type selection
alert ("The requested report is not available at this time.");
return false;
}
}
function create(form){
//state selection
var index;
index = document.UtfReport.states.selectedIndex;
StateName = document.UtfReport.states.options[index].value;
//report selection
index = document.UtfReport.report.selectedIndex;
ReportName = document.UtfReport.report.options[index].value;
//Month selection
index = document.UtfReport.month.selectedIndex;
MonthName = document.UtfReport.month.options[index].value;
//Year selection
index = document.UtfReport.year.selectedIndex;
YearName = document.UtfReport.year.options[index].value;
//Week selection
WeekName = "w0"; # this is hardcoded even in Js
所以我们需要重新创建这个逻辑:
import requests
# ReportPath = .. + LinkData = ...
temp = "https://www.treasurydirect.gov/govt/reports/tfmp/utf/{state}/dfiw0{mn:0>2}{yr}{rep_name}{state}.txt"
with requests.Session() as s:
soup = BeautifulSoup(s.get("https://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm").content)
# StateName = document.UtfReport.states.options[index].value;
states = [opt["value"] for opt in soup.select("#states option") if " Reed " not in opt.text]
#YearName = document.UtfReport.year.options[index].value;
available_years = [opt["value"] for opt in soup.select("#year option")]
# ReportName = document.UtfReport.report.options[index].value;
report_name = soup.find(id="report").find("option", text="Transaction Statement")["value"]
for state in states:
for year in available_years:
# could do [opt["value"] for opt in soup.select("#month option")]
# but always 12 months in a year
for mnth in range(1, 13):
url = temp.format(state=state, rep_name=report_name, yr=year, mn=mnth)
print(s.get(url).text)
如果你运行它,你会看到输出:
Final Report
Transaction Location
Effective Date Shares/Par Description Code Memo Number Code Account Number
--------------- ------------------------ ------------------------- ------------- -------- -------------------------
11-10 STATE DEPOSITS
01/04/2016 17,000.0000 11-10 STATE DEPOSITS 3308616 AL
01/05/2016 57,000.0000 11-10 STATE DEPOSITS 3308619 AL
01/06/2016 118,000.0000 11-10 STATE DEPOSITS 3308638 AL
01/07/2016 129,000.0000 11-10 STATE DEPOSITS 3308657 AL
01/08/2016 145,000.0000 11-10 STATE DEPOSITS 3308675 AL
01/11/2016 260,000.0000 11-10 STATE DEPOSITS 3308720 AL
01/12/2016 566,000.0000 11-10 STATE DEPOSITS 3308743 AL
01/13/2016 307,000.0000 11-10 STATE DEPOSITS 3308764 AL
01/14/2016 240,000.0000 11-10 STATE DEPOSITS 3308783 AL
01/15/2016 340,000.0000 11-10 STATE DEPOSITS 3308802 AL
01/19/2016 345,000.0000 11-10 STATE DEPOSITS 3308832 AL
01/20/2016 510,000.0000 11-10 STATE DEPOSITS 3308859 AL
01/21/2016 533,000.0000 11-10 STATE DEPOSITS 3308889 AL
01/22/2016 262,000.0000 11-10 STATE DEPOSITS 3308916 AL
01/25/2016 377,000.0000 11-10 STATE DEPOSITS 3308942 AL
01/26/2016 778,000.0000 11-10 STATE DEPOSITS 3308968 AL
01/27/2016 873,000.0000 11-10 STATE DEPOSITS 3308997 AL
01/28/2016 850,000.0000 11-10 STATE DEPOSITS 3309019 AL
01/29/2016 1,388,000.0000 11-10 STATE DEPOSITS 3309045 AL
01/29/2016 -6,997.0000 11-10 STATE DEPOSITS 3309069 AL AL
------------------------
8,088,003.0000
21-10 STATE UI WITHDRAWAL
01/04/2016 -183,550.0000 21-10 STATE UI WITHDRAWAL 3308617 AL AL
01/05/2016 -3,528,550.0000 21-10 STATE UI WITHDRAWAL 3308636 AL AL
01/06/2016 -333,800.0000 21-10 STATE UI WITHDRAWAL 3308655 AL AL
01/07/2016 -404,700.0000 21-10 STATE UI WITHDRAWAL 3308674 AL AL
01/08/2016 -276,600.0000 21-10 STATE UI WITHDRAWAL 3308717 AL AL
01/11/2016 -177,600.0000 21-10 STATE UI WITHDRAWAL 3308741 AL AL
01/12/2016 -3,207,250.0000 21-10 STATE UI WITHDRAWAL 3308760 AL AL
01/13/2016 -288,450.0000 21-10 STATE UI WITHDRAWAL 3308781 AL AL
01/14/2016 -192,050.0000 21-10 STATE UI WITHDRAWAL 3308800 AL AL
01/15/2016 -184,650.0000 21-10 STATE UI WITHDRAWAL 3308825 AL AL
01/19/2016 -3,115,900.0000 21-10 STATE UI WITHDRAWAL 3308855 AL AL
01/20/2016 -343,100.0000 21-10 STATE UI WITHDRAWAL 3308876 AL AL
01/21/2016 -187,750.0000 21-10 STATE UI WITHDRAWAL 3308906 AL AL
01/22/2016 -135,950.0000 21-10 STATE UI WITHDRAWAL 3308937 AL AL
01/25/2016 -136,000.0000 21-10 STATE UI WITHDRAWAL 3308963 AL AL
01/26/2016 -3,186,100.0000 21-10 STATE UI WITHDRAWAL 3308985 AL AL
01/27/2016 -310,500.0000 21-10 STATE UI WITHDRAWAL 3309014 AL AL
01/28/2016 -250,500.0000 21-10 STATE UI WITHDRAWAL 3309036 AL AL
01/29/2016 -147,300.0000 21-10 STATE UI WITHDRAWAL 3309066 AL AL
------------------------
-16,590,300.0000
34-10 BT FROM UI
01/22/2016 -63,394.0000 34-10 BT FROM UI 3308938 AL AL
01/29/2016 -19,169.0000 34-10 BT FROM UI 3309067 AL AL
------------------------
-82,563.0000
34-60 CWC OUT
01/08/2016 -2,577.9500 34-60 CWC OUT 3308718 HI AL
01/12/2016 -29,354.7300 34-60 CWC OUT 3308761 WY AL
01/12/2016 -4,186.2000 34-60 CWC OUT 3308762 NH AL
01/15/2016 -7,390.5700 34-60 CWC OUT 3308826 MT AL
01/15/2016 -34,003.1200 34-60 CWC OUT 3308827 WV AL
01/15/2016 -2,674.2900 34-60 CWC OUT 3308828 RI AL
01/15/2016 -12,695.3300 34-60 CWC OUT 3308829 NE AL
01/15/2016 -30,307.5600 34-60 CWC OUT 3308830 IN AL
01/20/2016 -115,833.7900 34-60 CWC OUT 3308879 VA AL
01/20/2016 -6,549.9200 34-60 CWC OUT 3308880 AK AL
01/20/2016 -10,316.4900 34-60 CWC OUT 3308881 ME AL
01/20/2016 -89,399.3900 34-60 CWC OUT 3308882 CA AL
01/25/2016 -10,015.5900 34-60 CWC OUT 3308966 MO AL
01/26/2016 -117.6100 34-60 CWC OUT 3308988 VT AL
01/26/2016 -17,058.7500 34-60 CWC OUT 3308989 NV AL
01/26/2016 -23,359.8400 34-60 CWC OUT 3308990 UT AL
01/26/2016 -21,240.3200 34-60 CWC OUT 3308991 OK AL
01/26/2016 -110,025.5800 34-60 CWC OUT 3308992 OH AL
01/26/2016 -87,745.5400 34-60 CWC OUT 3308993 MN AL
01/26/2016 -1,747.0500 34-60 CWC OUT 3308994 DE AL
01/28/2016 -439,500.8500 34-60 CWC OUT 3309039 TX AL
01/28/2016 -22,375.9600 34-60 CWC OUT 3309040 NC AL
01/28/2016 -49,726.7300 34-60 CWC OUT 3309041 MS AL
01/28/2016 -54,329.9400 34-60 CWC OUT 3309042 MA AL
01/28/2016 -221,805.0100 34-60 CWC OUT 3309043 GA AL
------------------------
-1,404,338.1100
答案 1 :(得分:-1)
你想做的事情,在高层次上是这样的:
一个。请注意,报告的命名方式有一个模式。如果存在模式,那么我们可以假设可以在代码中表示它。
B中。主要关注的是网址的最后一部分'/ar/dfiw00216tsar.txt'.
从这里,我们可以知道构建所有可能状态的字典,以及所有可能的报告类型的字典,并在for循环的每次迭代中迭代所有这些组合,包括日期,得到url,然后保存或以其他方式处理它。