如何添加id列来标识read_html()表?

时间:2016-11-16 19:56:51

标签: python python-3.x pandas iteration

考虑以下具有多个不同表格的网站(site1site2site3)。

我正在使用read_html将表格废弃到一个表格中,如下所示:

import multiprocessing
links = ['site1.com','site2.com','site3.com']

def process_url(url):
    return pd.concat(pd.read_html(url), ignore_index=False)   

pool = multiprocessing.Pool(processes=2)
df = pd.concat(pool.map(process_url, links), ignore_index=True)

通过上述程序,我得到一张桌子。虽然是我的预期,但添加一个标志或“表计数器”会有所帮助,只是为了不丢失表的引用(例如哪一行属于或对应于哪个表)。那么,如何将表的数量添加到一行?

像这样的东西,相同的单个表,但有table_num列:

    Bank Name   City    ST  CERT    Acquiring Institution   Closing Date    Updated Date        table_num
1   Allied Bank     Mulberry    AR  91.0    Today's Bank    September 23, 2016  October 17, 2016        1
2   The Woodbury Banking Company    Woodbury    GA  11297.0     United Bank     August 19, 2016     October 17, 2016    1
3   First CornerStone Bank  King of Prussia     PA  35312.0     First-Citizens Bank & Trust Company     May 6, 2016     September 6, 2016   1
4   Trust Company Bank  Memphis     TN  9956.0  The Bank of Fayette County  April 29, 2016  September 6, 2016   2
5   North Milwaukee State Bank  Milwaukee   WI  20364.0     First-Citizens Bank & Trust Company     March 11, 2016  June 16, 2016   2
6   Hometown National Bank  Longview    WA  35156.0     Twin City Bank  October 2, 2015     April 13, 2016  3
7   The Bank of Georgia     Peachtree City  GA  35259.0     Fidelity Bank   October 2, 2015     October 24, 2016        3
8   Premier Bank    Denver  CO  34112.0     United Fidelity Bank, fsb   July 10, 2015   August 17, 2016     3
9   Edgebrook Bank  Chicago     IL  57772.0     Republic Bank of Chicago    May 8, 2015     July 12, 2016   3
10  Doral Bank  NaN     NaN     NaN     NaN     NaN     NaN     4
11  En Espanol  San Juan    PR  32102.0     Banco Popular de Puerto Rico    February 27, 2015   May 13, 2015        4
12  Capitol City Bank & Trust Company   Atlanta     GA  33938.0     First-Citizens Bank & Trust Company     February 13, 2015   April 21, 2015  4
13  Valley Bank     Fort Lauderdale     FL  21793.0     Landmark Bank, National Association     June 20, 2014   June 29, 2015   5
14  Valley Bank     Moline  IL  10450.0     Great Southern Bank     June 20, 2014   June 26, 2015   5
15  Slavie Federal Savings Bank     Bel Air     MD  32368.0     Bay Bank, FSB   May 3, 2014     June 15, 2015   5
16  Columbia Savings Bank   Cincinnati  OH  32284.0     United Fidelity Bank, fsb   May 23, 2014    November 10, 2016   6
17  AztecAmerica Bank   NaN     NaN     NaN     NaN     NaN     NaN 6
18  En Espanol  Berwyn  IL  57866.0     Republic Bank of Chicago    May 16, 2014    October 20, 2016    6

例如,如果site1中有两个表,则该函数必须将0分配给table1的所有行,并关于table2中的site1函数必须将1分配给table2的所有行。

另一方面,如果site2有两个表,则该函数必须将3分配给table14table2的所有行生活在site2的所有表格。

此外,是否可以使用assign()或其他方法来获取每行的引用(例如来源表)?

1 个答案:

答案 0 :(得分:1)

尝试更改process_url()功能,如下所示:

def process_url(url):
    return pd.concat([x.assign(table_num=i)
                      for i,x in enumerate(pd.read_html(url))],
                     ignore_index=False)