我正在用Python抓取一个表,试图捕获每个字段,以便可以操纵要显示的字段。我以前使用过这样的设置,但是这次我出现索引超出范围错误。
我已经拉整张桌子没有问题,但是就像我说的,我想选择仅显示特定字段。我还希望每个部分(例如,新银行等)的标题。
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
NHsoup = BeautifulSoup(NHr.text, 'html.parser')
NHlist = []
for tr in NHsoup.find_all('tr'):
tds = tr.find_all('td')
print("Test: %s, Test: %s, Test: %s\n" % \
(tds[0].text, tds[1].text, tds[2].text))
答案 0 :(得分:1)
Pandas在.read_html()
函数的幕后使用bs4。如果您看到<table>
,<tr>
,<td>
标签,请让熊猫为您完成繁重的工作:
import pandas as pd
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
df = pd.read_html(NHurl)[0]
输出:
print (df.to_string())
Date Requested Financial Institution Name Location Determination Date
0 NaN NaN NaN NaN
1 New Bank New Bank New Bank New Bank
2 12/11/18 The Millyard Bank NaN NaN
3 Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination Interstate Bank Combination
4 01/16/19 Optima Bank & Trust Company with and into Camb... Portsmouth, NH 03/29/19
5 Acquisitions Acquisitions Acquisitions Acquisitions
6 NaN NaN NaN NaN
7 Conversions Conversions Conversions Conversions
8 NaN NaN NaN NaN
9 Change in Control Change in Control Change in Control Change in Control
10 NaN NaN NaN NaN
11 Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor... Amendment to Articles of Agreement or Incorpor...
12 11/26/18 John Hancock Trust Company Boston, MA 01/14/19
13 12/04/18 Franklin Savings Bank Franklin, NH 01/28/19
14 12/12/18 MFS Heritage Trust Company Boston, MA 01/28/19
15 02/25/19 Ankura Trust Company, LLC Fairfield, CT 03/22/19
16 4/25/19 Woodsville Guaranty Savings Bank Woodsville, NH 06/04/19
17 5/10/19 AB Trust Company New York, NY 06/04/19
18 Reduction in Capital Reduction in Capital Reduction in Capital Reduction in Capital
19 03/07/19 Primary Bank Bedford, NH 04/10/19
20 Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws Amendment to Bylaws
21 12/10/18 Northeast Credit Union Porstmouth, NH 02/25/19
22 2/25/19 Members First Credit Union Manchester, NH 04/05/19
23 4/24/19 St. Mary's Bank Manchester, NH 05/30/19
24 NaN NaN NaN NaN
25 Interstate Branch Office Interstate Branch Office Interstate Branch Office Interstate Branch Office
26 01/23/19 Newburyport Five Cents Savings Bank 141 Portsmouth Ave Exeter, NH 02/01/19
27 03/08/19 One Credit Union Newport, NH 03/29/19
28 03/01/19 JPMorgan Chase Bank, NA Nashua, NH 04/04/19
29 03/26/19 Mascoma Bank Lebanon, NH 04/09/19
30 04/24/19 Newburyport Five Cents Savings Bank 321 Lafayette Rd Hampton NH 05/08/19
31 Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure Interstate Branch Office Closure
32 02/15/19 The Provident Bank 321 Lafayette Rd Hampton, NH 02/25/19
33 New Branch Office New Branch Office New Branch Office New Branch Office
34 12/07/18 Bank of New Hampshire 16-18 South Main Street Concord NH 01/02/19
35 3/4/19 Triangle Credit Union 360 Daniel Webster Highway, Merrimack, NH 03/11/19
36 04/03/19 Bellwether Community Credit Union 425-453 Commercial Street Manchester, NH 04/17/19
37 06/11/19 Primary Bank 23 Crystal Avenue Derry NH 06/11/19
38 Branch Office Closure Branch Office Closure Branch Office Closure Branch Office Closure
39 5/15/19 Northeast Credit Union Merrimack, NH 05/21/19
40 New Loan Production Office New Loan Production Office New Loan Production Office New Loan Production Office
41 04/08/19 Community National Bank 367 Route 120, Unit B-5 Lebanon, NH 03766-1430 04/15/19
42 Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure Loan Production Office Closure
43 NaN NaN NaN NaN
44 Loan Production Office Relocations Loan Production Office Relocations Loan Production Office Relocations Loan Production Office Relocations
45 NaN NaN NaN NaN
46 Branch Office Relocations Branch Office Relocations Branch Office Relocations Branch Office Relocations
47 NaN NaN NaN NaN
48 Trade Name Requests Trade Name Requests Trade Name Requests Trade Name Requests
49 04/16/19 John Hancock Trust Company To use trade name "Manulife Investment Managem... 04/24/19
50 New Trust Company New Trust Company New Trust Company New Trust Company
51 02/19/19 Janney Trust Co., LLC NaN NaN
52 02/25/19 Darwin Trust Company of New Hampshire, LLC NaN NaN
53 Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company Dissolution of Trust Company
54 09/19/17 Cambridge Associates Fiduciary Trust, LLC Boston, MA 02/05/19
55 Trust Office Closure Trust Office Closure Trust Office Closure Trust Office Closure
56 5/10/19 Charter Trust Company Rochester, NH 05/20/19
57 New Trust Office New Trust Office New Trust Office New Trust Office
58 02/25/19 Ankura Trust Company, LLC 140 Sherman Street, 4th Floor Fairfield, CT 0... 03/22/19
59 Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office Relocation of Trust Office
60 01/23/19 Geode Capital Management Trust Company, LLC Relocate from: One Post Office Square, 20th Fl... 02/01/19
61 03/15/19 Drivetrain Trust Company LLC Relocate from: 630 3rd Avenue, 21st Flr New Y... 03/29/19
62 04/14/19 Boston Partners Trust Company Relocate from: 909 Third Avenue New York, NY ... 04/23/19
答案 1 :(得分:0)
您的代码假定len(tds) >= 3
,但看起来不正确
答案 2 :(得分:0)
要抓取标题和数据,可以使用选择器tr.select('td, th')
:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
print('Scraping NH Dept of Banking...')
print()
NHurl = 'https://www.nh.gov/banking/corporate-activities/index.htm'
NHr = requests.get(NHurl, headers = headers)
soup = BeautifulSoup(NHr.text, 'lxml')
rows = [[td.text.strip() for td in tr.select('td, th')] for tr in soup.select('tr') if tr.select('td, th')]
import textwrap
from itertools import zip_longest
rows = [*zip(*zip_longest(*rows))]
for row in rows:
for data in row:
if data is None:
data = "-"
print('{: ^30}'.format(textwrap.shorten(data, 30, placeholder='...')), end='║')
print()
打印:
Date Requested ║ Financial Institution Name ║ Location ║ Determination Date ║
║ ║ ║ ║
New Bank ║ - ║ - ║ - ║
12/11/18 ║ The Millyard Bank ║ ║ ║
Interstate Bank Combination ║ - ║ - ║ - ║
01/16/19 ║Optima Bank & Trust Company...║ Portsmouth, NH ║ 03/29/19 ║
Acquisitions ║ - ║ - ║ - ║
║ ║ ║ ║
Conversions ║ - ║ - ║ - ║
║ ║ ║ ║
Change in Control ║ - ║ - ║ - ║
║ ║ ║ ║
Amendment to Articles of... ║ - ║ - ║ - ║
11/26/18 ║ John Hancock Trust Company ║ Boston, MA ║ 01/14/19 ║
12/04/18 ║ Franklin Savings Bank ║ Franklin, NH ║ 01/28/19 ║
12/12/18 ║ MFS Heritage Trust Company ║ Boston, MA ║ 01/28/19 ║
02/25/19 ║ Ankura Trust Company, LLC ║ Fairfield, CT ║ 03/22/19 ║
4/25/19 ║Woodsville Guaranty Savings...║ Woodsville, NH ║ 06/04/19 ║
5/10/19 ║ AB Trust Company ║ New York, NY ║ 06/04/19 ║
Reduction in Capital ║ - ║ - ║ - ║
03/07/19 ║ Primary Bank ║ Bedford, NH ║ 04/10/19 ║
Amendment to Bylaws ║ - ║ - ║ - ║
12/10/18 ║ Northeast Credit Union ║ Porstmouth, NH ║ 02/25/19 ║
2/25/19 ║ Members First Credit Union ║ Manchester, NH ║ 04/05/19 ║
4/24/19 ║ St. Mary's Bank ║ Manchester, NH ║ 05/30/19 ║
║ ║ ║ ║
Interstate Branch Office ║ - ║ - ║ - ║
01/23/19 ║ Newburyport Five Cents... ║141 Portsmouth Ave Exeter, NH ║ 02/01/19 ║
03/08/19 ║ One Credit Union ║ Newport, NH ║ 03/29/19 ║
03/01/19 ║ JPMorgan Chase Bank, NA ║ Nashua, NH ║ 04/04/19 ║
03/26/19 ║ Mascoma Bank ║ Lebanon, NH ║ 04/09/19 ║
04/24/19 ║ Newburyport Five Cents... ║ 321 Lafayette Rd Hampton NH ║ 05/08/19 ║
Interstate Branch Office... ║ - ║ - ║ - ║
02/15/19 ║ The Provident Bank ║ 321 Lafayette Rd Hampton, NH ║ 02/25/19 ║
New Branch Office ║ - ║ - ║ - ║
12/07/18 ║ Bank of New Hampshire ║ 16-18 South Main Street... ║ 01/02/19 ║
3/4/19 ║ Triangle Credit Union ║360 Daniel Webster Highway,...║ 03/11/19 ║
04/03/19 ║Bellwether Community Credit...║ 425-453 Commercial Street... ║ 04/17/19 ║
06/11/19 ║ Primary Bank ║ 23 Crystal Avenue Derry NH ║ 06/11/19 ║
Branch Office Closure ║ - ║ - ║ - ║
5/15/19 ║ Northeast Credit Union ║ Merrimack, NH ║ 05/21/19 ║
New Loan Production Office ║ - ║ - ║ - ║
04/08/19 ║ Community National Bank ║ 367 Route 120, Unit B-5... ║ 04/15/19 ║
Loan Production Office Closure║ - ║ - ║ - ║
║ ║ ║ ║
Loan Production Office... ║ - ║ - ║ - ║
║ ║ ║ ║
Branch Office Relocations ║ - ║ - ║ - ║
║ ║ ║ ║
Trade Name Requests ║ - ║ - ║ - ║
04/16/19 ║ John Hancock Trust Company ║To use trade name "Manulife...║ 04/24/19 ║
New Trust Company ║ - ║ - ║ - ║
02/19/19 ║ Janney Trust Co., LLC ║ ║ ║
02/25/19 ║Darwin Trust Company of New...║ ║ ║
Dissolution of Trust Company ║ - ║ - ║ - ║
09/19/17 ║ Cambridge Associates... ║ Boston, MA ║ 02/05/19 ║
Trust Office Closure ║ - ║ - ║ - ║
5/10/19 ║ Charter Trust Company ║ Rochester, NH ║ 05/20/19 ║
New Trust Office ║ - ║ - ║ - ║
02/25/19 ║ Ankura Trust Company, LLC ║ 140 Sherman Street, 4th... ║ 03/22/19 ║
Relocation of Trust Office ║ - ║ - ║ - ║
01/23/19 ║ Geode Capital Management... ║ Relocate from: One Post... ║ 02/01/19 ║
03/15/19 ║ Drivetrain Trust Company LLC ║ Relocate from: 630 3rd... ║ 03/29/19 ║
04/14/19 ║Boston Partners Trust Company ║ Relocate from: 909 Third... ║ 04/23/19 ║