考虑以下具有多个不同表格的网站(site1,site2,site3)。
我正在使用read_html将表格废弃到一个表格中,如下所示:
import multiprocessing
links = ['site1.com','site2.com','site3.com']
def process_url(url):
return pd.concat(pd.read_html(url), ignore_index=False)
pool = multiprocessing.Pool(processes=2)
df = pd.concat(pool.map(process_url, links), ignore_index=True)
通过上述程序,我得到一张桌子。虽然是我的预期,但添加一个标志或“表计数器”会有所帮助,只是为了不丢失表的引用(例如哪一行属于或对应于哪个表)。那么,如何将表的数量添加到一行?
像这样的东西,相同的单个表,但有table_num
列:
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date table_num
1 Allied Bank Mulberry AR 91.0 Today's Bank September 23, 2016 October 17, 2016 1
2 The Woodbury Banking Company Woodbury GA 11297.0 United Bank August 19, 2016 October 17, 2016 1
3 First CornerStone Bank King of Prussia PA 35312.0 First-Citizens Bank & Trust Company May 6, 2016 September 6, 2016 1
4 Trust Company Bank Memphis TN 9956.0 The Bank of Fayette County April 29, 2016 September 6, 2016 2
5 North Milwaukee State Bank Milwaukee WI 20364.0 First-Citizens Bank & Trust Company March 11, 2016 June 16, 2016 2
6 Hometown National Bank Longview WA 35156.0 Twin City Bank October 2, 2015 April 13, 2016 3
7 The Bank of Georgia Peachtree City GA 35259.0 Fidelity Bank October 2, 2015 October 24, 2016 3
8 Premier Bank Denver CO 34112.0 United Fidelity Bank, fsb July 10, 2015 August 17, 2016 3
9 Edgebrook Bank Chicago IL 57772.0 Republic Bank of Chicago May 8, 2015 July 12, 2016 3
10 Doral Bank NaN NaN NaN NaN NaN NaN 4
11 En Espanol San Juan PR 32102.0 Banco Popular de Puerto Rico February 27, 2015 May 13, 2015 4
12 Capitol City Bank & Trust Company Atlanta GA 33938.0 First-Citizens Bank & Trust Company February 13, 2015 April 21, 2015 4
13 Valley Bank Fort Lauderdale FL 21793.0 Landmark Bank, National Association June 20, 2014 June 29, 2015 5
14 Valley Bank Moline IL 10450.0 Great Southern Bank June 20, 2014 June 26, 2015 5
15 Slavie Federal Savings Bank Bel Air MD 32368.0 Bay Bank, FSB May 3, 2014 June 15, 2015 5
16 Columbia Savings Bank Cincinnati OH 32284.0 United Fidelity Bank, fsb May 23, 2014 November 10, 2016 6
17 AztecAmerica Bank NaN NaN NaN NaN NaN NaN 6
18 En Espanol Berwyn IL 57866.0 Republic Bank of Chicago May 16, 2014 October 20, 2016 6
例如,如果site1中有两个表,则该函数必须将0
分配给table1
的所有行,并关于table2
中的site1
函数必须将1
分配给table2
的所有行。
另一方面,如果site2
有两个表,则该函数必须将3
分配给table1
和4
到table2
的所有行生活在site2
的所有表格。
此外,是否可以使用assign()或其他方法来获取每行的引用(例如来源表)?
答案 0 :(得分:1)
尝试更改process_url()
功能,如下所示:
def process_url(url):
return pd.concat([x.assign(table_num=i)
for i,x in enumerate(pd.read_html(url))],
ignore_index=False)