Question

我正在尝试抓取网页https://www.bolsadeproductos.cl/pagador/20，以获取底部表格。但是在使用下一个代码时，我无法获得所有结果，只有前10行。如何遍历所有不同的标签？

from selenium import webdriver
from bs4 import BeautifulSoup
import time
import pandas as pd

driver = webdriver.Edge('C:\\Users\\facun\\Documents\\msedgedriver.exe')
driver.get('https://www.bolsadeproductos.cl/pagador/20')

df = pd.read_html(driver.page_source, attrs = {'id': 'tbl_export'})

谢谢。

Answer 1

这很简单，只需在代码中替换以下行即可：

driver.get('https://www.bolsadeproductos.cl/pagador/20')

到

driver.get('https://www.bolsadeproductos.cl/pagador/tablePagador/20/undefined/0')

df = pd.read_html(driver.page_source, attrs = {'id': 'tbl_export'})
print(df)

输出：

[    Fecha Operacion Nemotecnico Vendedor Comprador            Monto   Tasa  Plazo(Dias)
0        30-06-2015     FANGLOS       LV        LV     $ 26.586.879  0,34%           30
1        26-06-2015     FANGLOS       LV        LV     $ 26.574.872  0,34%           34
2        27-05-2015     FANGLOS       LV        LV  $ 1.059.184.359  0,34%           16
3        16-06-2015     FANGLOS       LV        LV    $ 996.461.527  0,34%           37
4        16-06-2015     FANGLOS       LV        LV    $ 996.461.527  0,34%           37
..              ...         ...      ...       ...              ...    ...          ...
309      03-03-2020     FANGLOS       LV        LV      $ 8.558.358  0,26%           13
310      06-03-2020     FANGLOS       LV        LV      $ 8.560.581  0,26%           10
311      06-03-2020     AANGLOS       LV        LV     $ 63.596.531  0,26%           59
312      06-03-2020     FANGLOS       LV       BCI     $ 45.678.549  0,26%           31
313      19-05-2020     FANGLOS      BCI       BCI    $ 849.422.583  0,22%           17

Answer 2

通过JavaScript动态加载数据，但是您可以使用requests模块获取结果：

import requests
from bs4 import BeautifulSoup


url = 'https://www.bolsadeproductos.cl/pagador/tablePagador/20/undefined/0'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for i, row in enumerate(soup.select('tr:has(td)'), 1):
    row = [td.get_text(strip=True) for td in row.select('td')]
    print('{:<5}{:<15}{:<15}{:<10}{:<10}{:<20}{:<15}{:<15}'.format(i, *row))

打印：

1    30-06-2015     FANGLOS        LV        LV        $ 26.586.879        0,34%          30             
2    26-06-2015     FANGLOS        LV        LV        $ 26.574.872        0,34%          34             
3    27-05-2015     FANGLOS        LV        LV        $ 1.059.184.359     0,34%          16             
4    16-06-2015     FANGLOS        LV        LV        $ 996.461.527       0,34%          37             
5    16-06-2015     FANGLOS        LV        LV        $ 996.461.527       0,34%          37             
6    27-05-2015     FANGLOS        LV        LV        $ 1.059.184.359     0,34%          16             

    
... all the way to:

309  23-12-2019     FANGLOS        LV        BCI       $ 193.475.303       0,26%          56             
310  03-03-2020     FANGLOS        LV        LV        $ 8.558.358         0,26%          13             
311  06-03-2020     FANGLOS        LV        LV        $ 8.560.581         0,26%          10             
312  06-03-2020     AANGLOS        LV        LV        $ 63.596.531        0,26%          59             
313  06-03-2020     FANGLOS        LV        BCI       $ 45.678.549        0,26%          31             
314  19-05-2020     FANGLOS        BCI       BCI       $ 849.422.583       0,22%          17

Answer 3

该网站看起来使用了javascript加载器。查看Selenium Waits，直到页面完全加载。

使用相同的网址但使用不同的标记来抓取多个html表

3 个答案: