在不同网页上抓取表格

时间:2019-08-17 07:44:57

标签: python web-scraping

我怎样才能用python的webscrap扩展在不同页面上的同一张表?我能够做到,但它在第一页停止。 这是一个示例:https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=1

这是我的代码:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq

my_link = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en"
webpage = ureq(my_link).read()
htmlpage = soup(webpage , 'html.parser')
containers = htmlpage.findAll("td", {"class":"u-hidden -xs"})

filename = "Dati odierni listino FTSEMIB.csv"
f = open(filename, 'w')
headers = "Stock, price, %, time, opening\n"
f.write(headers)

for i in range(1, len(containers), 6):
    stock = containers[i-1].text.strip()
    price = containers[i].text.strip()
    percentage = containers[i+1].text.strip()
    time = containers[i+2].text.strip()
    opening = containers[i+3].text.strip()

    f.write(stock + "," + price + "," + percentage + "," + time + "," + opening + "\n")

f.close()

(无法在一页中显示所有数据)

编辑:

我也解决了这个问题:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq

my_link = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en"
my_link2 = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=2"
webpage = ureq(my_link).read()
webpage2 = ureq(my_link2).read()
htmlpage = soup(webpage , 'html.parser')
htmlpage2 = soup(webpage2, 'html.parser')
containers = htmlpage.findAll("td", {"class":"u-hidden -xs"}) + htmlpage2.findAll("td", {"class":"u-hidden -xs"})

filename = "Dati odierni listino FTSEMIB.csv"
f = open(filename, 'w')
headers = "Stock, price, %, time, opening\n"
f.write(headers)

for i in range(1, len(containers), 6):
    stock = containers[i-1].text.strip()
    price = containers[i].text.strip()
    percentage = containers[i+1].text.strip()
    time = containers[i+2].text.strip()
    opening = containers[i+3].text.strip()

    f.write(stock + "," + price + "," + percentage + "," + time + "," + opening + "\n")

f.close()

但是如果桌子长20页,我无法想象用这种方式做,这就是为什么我要寻找“更智能”的东西。

2 个答案:

答案 0 :(得分:1)

一种可能性是找到指向下一页a[title="Next"]的链接。如果该链接不存在,则位于最后一页:

import requests
from bs4 import BeautifulSoup

url = 'https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=1'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

from textwrap import shorten

page = 1
while True:
    print()
    print('Page no. {}'.format(page))
    print('-' * 80)

    for tr in soup.select('tr'):
        for td in tr.select('td')[1:]:
            txt = td.get_text(strip=True, separator=' ')
            print('{: >25}'.format(shorten(txt, 25)), end='')
        print()

    m = soup.select_one('a[title="Next"][href]')
    if m:
        url = 'https://www.borsaitaliana.it' + m['href']
        soup = BeautifulSoup(requests.get(url).text, 'lxml')
        page += 1
    else:
        break

打印:

Page no. 1
--------------------------------------------------------------------------------

                      A2a                   1.5675                    +1.33                 17:35:32                    1.555                    Close
                 Amplifon                    22.30                    +1.27                 17:35:39                    22.00                    Close
                 Atlantia                    22.92                    +0.26                 17:41:55                    22.94                    Close
           Azimut Holding                   15.595                    +1.93                 17:35:48                   15.285                    Close
                Banco Bpm                    1.685                    +4.04                 17:35:58                     1.63                    Close
               Bper Banca                    3.078                    +2.19                 17:35:03                    3.022                    Close
             Buzzi Unicem                    18.41                    +0.60                 17:35:13                   18.445                    Close
                  Campari                     7.84                    +0.71                 17:35:03                     7.85                    Close
           Cnh Industrial                    7.956                    +1.69                 17:35:29                     7.80                    Close
                 Diasorin                   106.00                    +1.83                 17:35:53                   104.10                    Close
                     Enel                    6.285                    +4.59                 17:35:58                    6.064                    Close
                      Eni                    13.04                    -0.47                 17:39:49                   12.972                    Close
                     Exor                    57.16                    -1.21                 17:35:00                    58.02                    Close
                  Ferrari                   140.05                    -0.11                 17:37:09                   141.20                    Close
Fiat Chrysler Automobiles                   11.054                    -2.71                 17:37:07                   11.232                    Close
               Finecobank                    8.656                    +1.67                 17:35:49                     8.67                    Close
                 Generali                    15.98                    +0.38                 17:40:02                    15.93                    Close
                     Hera                    3.466                    +2.79                 17:35:06                    3.396                    Close
          Intesa Sanpaolo                    1.882                    +1.97                 17:41:21                    1.856                    Close
                  Italgas                    5.674                    +0.32                 17:35:41                     5.70                    Close

Page no. 2
--------------------------------------------------------------------------------

   Juventus Football Club                     1.46                    +2.21                 17:35:42                     1.43                    Close
                 Leonardo                   10.095                    +2.91                 17:35:59                     9.81                    Close
               Mediobanca                    8.508                    +2.14                 17:35:33                    8.332                    Close
                  Moncler                    33.86                    -0.85                 17:35:25                    33.86                    Close
                     Nexi                     9.80                    +0.00                 17:35:04                     9.79                    Close
              Pirelli & C                    4.516                    -1.07                 17:35:24                     4.50                    Close
           Poste Italiane                    9.234                    +0.98                 17:35:24                     9.18                    Close
                 Prysmian                   17.725                    +0.25                 17:35:59                    17.70                    Close
                Recordati                    38.80                    +1.57                 17:35:02                    38.74                    Close
                   Saipem                    4.022                    +2.55                 17:35:04                    3.932                    Close
      Salvatore Ferragamo                   17.145                    -1.89                 17:35:19                   17.425                    Close
                     Snam                    4.487                    +2.30                 17:35:53                    4.391                    Close
       Stmicroelectronics                   15.805                    +2.10                 17:35:48                    15.62                    Close
           Telecom Italia                   0.4451                    +0.75                 17:35:31                   0.4438                    Close
                  Tenaris                    9.484                    +0.51                 17:35:49                     9.40                    Close
       Terna - Rete [...]                    5.432                    +2.22                 17:35:55                    5.362                    Close
                Ubi Banca                    2.217                    +5.62                 17:38:45                    2.105                    Close
                Unicredit                    9.531                    +3.71                 17:39:39                     9.27                    Close
                   Unipol                    4.313                    +1.67                 17:35:41                    4.277                    Close
                Unipolsai                    2.208                    -0.32                 17:35:03                    2.221                    Close

答案 1 :(得分:0)

遍历页面上的每个<tr>标签之后,需要使用href转到下一页。看起来好像是"/borsa/azioni/ftse-mib/lista.html?lang=en&page=2",在这种情况下,您只需遍历page=即可切换到下一页。

如果您发布一些代码,我们可以为您提供更多帮助:)