我想从网页中提取一张表格..但是它有一个登录页面。所以我想如果我可以下载网页然后抓取。网站也会保留相同的网址。
答案 0 :(得分:0)
这是一个例子
from bs4 import BeautifulSoup
html = '''<html><table>
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
<tr>
<td>Ernst Handel</td>
<td>Roland Mendel</td>
<td>Austria</td>
</tr>
<tr>
<td>Island Trading</td>
<td>Helen Bennett</td>
<td>UK</td>
</tr>
<tr>
<td>Laughing Bacchus Winecellars</td>
<td>Yoshi Tannamuri</td>
<td>Canada</td>
</tr>
<tr>
<td>Magazzini Alimentari Riuniti</td>
<td>Giovanni Rovelli</td>
<td>Italy</td>
</tr>
</table></html>'''
soup = BeautifulSoup(html, 'html.parser')
for table_header in soup.find_all('th'):
print('Header: ' + table_header.text)
for row in soup.find_all('tr'):
cells = row.find_all('td')
if cells:
print('row:')
for cell in cells:
print('\t' + cell.text)
输出
Header: Company
Header: Contact
Header: Country
row:
Alfreds Futterkiste
Maria Anders
Germany
row:
Centro comercial Moctezuma
Francisco Chang
Mexico
row:
Ernst Handel
Roland Mendel
Austria
row:
Island Trading
Helen Bennett
UK
row:
Laughing Bacchus Winecellars
Yoshi Tannamuri
Canada
row:
Magazzini Alimentari Riuniti
Giovanni Rovelli
Italy