Python-> BeautifulSoup-> Webscraping->下拉菜单

时间:2016-09-27 11:09:55

标签: python drop-down-menu web-scraping beautifulsoup

所以我试图转储这个网站上的所有报告:

https://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm

州:所有国家(不是里德法案福利或里德法案管理员)

报告:交易声明

月:所有月份

年:全年

查看网站的源代码,我知道状态变量:

<form action="get" name="UtfReport">
<fieldset>
 <table>
<tr>

    <td>
    <label for="states">State</label><br />
        <select name="states" id="states" size="01">
        <option value="al" selected>Alabama</option>
        <option value="b2">Alabama Reed Act Benefit</option>
        <option value="b3">Alabama Reed Act Admin</option>
        <option value="ak">Alaska</option>
        <option value="a2">Alaska Reed Act Benefit</option> 

所以我知道我需要创建一个像这样的字符串列表

https://www.treasurydirect.gov/govt/reports/tfmp/utf/[a1]/dfiw00[116]tsar.txt

https://www.treasurydirect.gov/govt/reports/tfmp/utf/[a1]/dfiw00[216]tsar.txt

...

所以这是我目前的做法:

import requests, bs4



for i in range(1,13):
    print('https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw0'+str(i).zfill(2),'16tsar.txt')


res = requests.get('https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw00216tsar.txt')

res.raise_for_status()
states = bs4.BeautifulSoup(res.text, 'lxml')

result1.append(res.text)

我创建url字符串的努力也遇到了问题,因为这是上面代码的输出(dfiw00X和16tsar.txt之间有一个空格,我不知道为什么):< / p>

https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw001 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw002 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw003 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw004 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw005 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw006 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw007 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw008 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw009 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw010 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw011 16tsar.txt
https://www.treasurydirect.gov/govt/reports/tfmp/utf/ar/dfiw012 16tsar.txt

所以我的问题是:必须有一种比我目前正在尝试的方式更好的方法,所以如果有人能告诉我如何,我会非常感激。

感谢您的时间,

2 个答案:

答案 0 :(得分:3)

您需要进行一些硬编码,请求与utfnav.js中的代码放在一起,我们感兴趣的主要部分如下:

//assembles path to reports

          ReportPath = "/govt/reports/tfmp/utf/"+StateName+"/dfi";
          LinkData = (ReportPath+WeekName+MonthName+YearName+ReportName+StateName+".txt");

          return true;
         }
        }
        else
        {
//displays when dates are not valid for report type selection
         alert ("The requested report is not available at this time.");
         return false;
        }
        }


    function create(form){

    //state selection
    var index;
    index = document.UtfReport.states.selectedIndex;
    StateName = document.UtfReport.states.options[index].value;

    //report selection
    index = document.UtfReport.report.selectedIndex;
    ReportName = document.UtfReport.report.options[index].value;

    //Month selection
    index = document.UtfReport.month.selectedIndex;
    MonthName = document.UtfReport.month.options[index].value;


    //Year selection
    index = document.UtfReport.year.selectedIndex;
    YearName = document.UtfReport.year.options[index].value;



    //Week selection
    WeekName = "w0"; # this is hardcoded even in Js

所以我们需要重新创建这个逻辑:

import requests


# ReportPath  = .. + LinkData =  ...
temp = "https://www.treasurydirect.gov/govt/reports/tfmp/utf/{state}/dfiw0{mn:0>2}{yr}{rep_name}{state}.txt"
with requests.Session() as s:
    soup = BeautifulSoup(s.get("https://www.treasurydirect.gov/govt/reports/tfmp/tfmp_utf.htm").content)

    # StateName = document.UtfReport.states.options[index].value;
    states = [opt["value"] for opt in soup.select("#states option") if " Reed " not in opt.text]

     #YearName = document.UtfReport.year.options[index].value;
    available_years = [opt["value"] for opt in soup.select("#year option")]
    # ReportName = document.UtfReport.report.options[index].value;
    report_name = soup.find(id="report").find("option", text="Transaction Statement")["value"]

    for state in states:
        for year in available_years:
            # could do [opt["value"] for opt in soup.select("#month option")]
            # but always 12 months in a year
            for mnth in range(1, 13):
                url = temp.format(state=state, rep_name=report_name, yr=year, mn=mnth)
                print(s.get(url).text)

如果你运行它,你会看到输出:

Final Report

                                               Transaction                               Location
    Effective Date                 Shares/Par  Description Code           Memo Number    Code      Account Number
    ---------------  ------------------------  -------------------------  -------------  --------  -------------------------
11-10 STATE DEPOSITS     
    01/04/2016                    17,000.0000  11-10 STATE DEPOSITS        3308616                 AL                       
    01/05/2016                    57,000.0000  11-10 STATE DEPOSITS        3308619                 AL                       
    01/06/2016                   118,000.0000  11-10 STATE DEPOSITS        3308638                 AL                       
    01/07/2016                   129,000.0000  11-10 STATE DEPOSITS        3308657                 AL                       
    01/08/2016                   145,000.0000  11-10 STATE DEPOSITS        3308675                 AL                       
    01/11/2016                   260,000.0000  11-10 STATE DEPOSITS        3308720                 AL                       
    01/12/2016                   566,000.0000  11-10 STATE DEPOSITS        3308743                 AL                       
    01/13/2016                   307,000.0000  11-10 STATE DEPOSITS        3308764                 AL                       
    01/14/2016                   240,000.0000  11-10 STATE DEPOSITS        3308783                 AL                       
    01/15/2016                   340,000.0000  11-10 STATE DEPOSITS        3308802                 AL                       
    01/19/2016                   345,000.0000  11-10 STATE DEPOSITS        3308832                 AL                       
    01/20/2016                   510,000.0000  11-10 STATE DEPOSITS        3308859                 AL                       
    01/21/2016                   533,000.0000  11-10 STATE DEPOSITS        3308889                 AL                       
    01/22/2016                   262,000.0000  11-10 STATE DEPOSITS        3308916                 AL                       
    01/25/2016                   377,000.0000  11-10 STATE DEPOSITS        3308942                 AL                       
    01/26/2016                   778,000.0000  11-10 STATE DEPOSITS        3308968                 AL                       
    01/27/2016                   873,000.0000  11-10 STATE DEPOSITS        3308997                 AL                       
    01/28/2016                   850,000.0000  11-10 STATE DEPOSITS        3309019                 AL                       
    01/29/2016                 1,388,000.0000  11-10 STATE DEPOSITS        3309045                 AL                       
    01/29/2016                    -6,997.0000  11-10 STATE DEPOSITS        3309069       AL        AL                       
                     ------------------------
                               8,088,003.0000

21-10 STATE UI WITHDRAWAL
    01/04/2016                  -183,550.0000  21-10 STATE UI WITHDRAWAL   3308617       AL        AL                       
    01/05/2016                -3,528,550.0000  21-10 STATE UI WITHDRAWAL   3308636       AL        AL                       
    01/06/2016                  -333,800.0000  21-10 STATE UI WITHDRAWAL   3308655       AL        AL                       
    01/07/2016                  -404,700.0000  21-10 STATE UI WITHDRAWAL   3308674       AL        AL                       
    01/08/2016                  -276,600.0000  21-10 STATE UI WITHDRAWAL   3308717       AL        AL                       
    01/11/2016                  -177,600.0000  21-10 STATE UI WITHDRAWAL   3308741       AL        AL                       
    01/12/2016                -3,207,250.0000  21-10 STATE UI WITHDRAWAL   3308760       AL        AL                       
    01/13/2016                  -288,450.0000  21-10 STATE UI WITHDRAWAL   3308781       AL        AL                       
    01/14/2016                  -192,050.0000  21-10 STATE UI WITHDRAWAL   3308800       AL        AL                       
    01/15/2016                  -184,650.0000  21-10 STATE UI WITHDRAWAL   3308825       AL        AL                       
    01/19/2016                -3,115,900.0000  21-10 STATE UI WITHDRAWAL   3308855       AL        AL                       
    01/20/2016                  -343,100.0000  21-10 STATE UI WITHDRAWAL   3308876       AL        AL                       
    01/21/2016                  -187,750.0000  21-10 STATE UI WITHDRAWAL   3308906       AL        AL                       
    01/22/2016                  -135,950.0000  21-10 STATE UI WITHDRAWAL   3308937       AL        AL                       
    01/25/2016                  -136,000.0000  21-10 STATE UI WITHDRAWAL   3308963       AL        AL                       
    01/26/2016                -3,186,100.0000  21-10 STATE UI WITHDRAWAL   3308985       AL        AL                       
    01/27/2016                  -310,500.0000  21-10 STATE UI WITHDRAWAL   3309014       AL        AL                       
    01/28/2016                  -250,500.0000  21-10 STATE UI WITHDRAWAL   3309036       AL        AL                       
    01/29/2016                  -147,300.0000  21-10 STATE UI WITHDRAWAL   3309066       AL        AL                       
                     ------------------------
                             -16,590,300.0000

34-10 BT FROM UI         
    01/22/2016                   -63,394.0000  34-10 BT FROM UI            3308938       AL        AL                       
    01/29/2016                   -19,169.0000  34-10 BT FROM UI            3309067       AL        AL                       
                     ------------------------
                                 -82,563.0000

34-60 CWC OUT            
    01/08/2016                    -2,577.9500  34-60 CWC OUT               3308718       HI        AL                       
    01/12/2016                   -29,354.7300  34-60 CWC OUT               3308761       WY        AL                       
    01/12/2016                    -4,186.2000  34-60 CWC OUT               3308762       NH        AL                       
    01/15/2016                    -7,390.5700  34-60 CWC OUT               3308826       MT        AL                       
    01/15/2016                   -34,003.1200  34-60 CWC OUT               3308827       WV        AL                       
    01/15/2016                    -2,674.2900  34-60 CWC OUT               3308828       RI        AL                       
    01/15/2016                   -12,695.3300  34-60 CWC OUT               3308829       NE        AL                       
    01/15/2016                   -30,307.5600  34-60 CWC OUT               3308830       IN        AL                       
    01/20/2016                  -115,833.7900  34-60 CWC OUT               3308879       VA        AL                       
    01/20/2016                    -6,549.9200  34-60 CWC OUT               3308880       AK        AL                       
    01/20/2016                   -10,316.4900  34-60 CWC OUT               3308881       ME        AL                       
    01/20/2016                   -89,399.3900  34-60 CWC OUT               3308882       CA        AL                       
    01/25/2016                   -10,015.5900  34-60 CWC OUT               3308966       MO        AL                       
    01/26/2016                      -117.6100  34-60 CWC OUT               3308988       VT        AL                       
    01/26/2016                   -17,058.7500  34-60 CWC OUT               3308989       NV        AL                       
    01/26/2016                   -23,359.8400  34-60 CWC OUT               3308990       UT        AL                       
    01/26/2016                   -21,240.3200  34-60 CWC OUT               3308991       OK        AL                       
    01/26/2016                  -110,025.5800  34-60 CWC OUT               3308992       OH        AL                       
    01/26/2016                   -87,745.5400  34-60 CWC OUT               3308993       MN        AL                       
    01/26/2016                    -1,747.0500  34-60 CWC OUT               3308994       DE        AL                       
    01/28/2016                  -439,500.8500  34-60 CWC OUT               3309039       TX        AL                       
    01/28/2016                   -22,375.9600  34-60 CWC OUT               3309040       NC        AL                       
    01/28/2016                   -49,726.7300  34-60 CWC OUT               3309041       MS        AL                       
    01/28/2016                   -54,329.9400  34-60 CWC OUT               3309042       MA        AL                       
    01/28/2016                  -221,805.0100  34-60 CWC OUT               3309043       GA        AL                       
                     ------------------------
                              -1,404,338.1100

答案 1 :(得分:-1)

一种很好的第一种抓取方法:模式匹配启发式

你想做的事情,在高层次上是这样的:

  1. 识别来源中的模式。
  2. 代表代码中模式的性质。
  3. 根据该代码进行刮擦。
  4. 我不会在这里编写整个代码,但概述了我将采取的一般方法。

    一个。请注意,报告的命名方式有一个模式。如果存在模式,那么我们可以假设可以在代码中表示它。

    B中。主要关注的是网址的最后一部分'/ar/dfiw00216tsar.txt'.

    1. / ar /引用州
    2. dfiw看起来不变,乍一看
    3. 00216引用日期
    4. tsar引用报告类型
    5. .txt乍一看似乎不变
    6. 从这里,我们可以知道构建所有可能状态的字典,以及所有可能的报告类型的字典,并在for循环的每次迭代中迭代所有这些组合,包括日期,得到url,然后保存或以其他方式处理它。