我怎样才能用python的webscrap扩展在不同页面上的同一张表?我能够做到,但它在第一页停止。 这是一个示例:https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=1
这是我的代码:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
my_link = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en"
webpage = ureq(my_link).read()
htmlpage = soup(webpage , 'html.parser')
containers = htmlpage.findAll("td", {"class":"u-hidden -xs"})
filename = "Dati odierni listino FTSEMIB.csv"
f = open(filename, 'w')
headers = "Stock, price, %, time, opening\n"
f.write(headers)
for i in range(1, len(containers), 6):
stock = containers[i-1].text.strip()
price = containers[i].text.strip()
percentage = containers[i+1].text.strip()
time = containers[i+2].text.strip()
opening = containers[i+3].text.strip()
f.write(stock + "," + price + "," + percentage + "," + time + "," + opening + "\n")
f.close()
(无法在一页中显示所有数据)
编辑:
我也解决了这个问题:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
my_link = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en"
my_link2 = "https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=2"
webpage = ureq(my_link).read()
webpage2 = ureq(my_link2).read()
htmlpage = soup(webpage , 'html.parser')
htmlpage2 = soup(webpage2, 'html.parser')
containers = htmlpage.findAll("td", {"class":"u-hidden -xs"}) + htmlpage2.findAll("td", {"class":"u-hidden -xs"})
filename = "Dati odierni listino FTSEMIB.csv"
f = open(filename, 'w')
headers = "Stock, price, %, time, opening\n"
f.write(headers)
for i in range(1, len(containers), 6):
stock = containers[i-1].text.strip()
price = containers[i].text.strip()
percentage = containers[i+1].text.strip()
time = containers[i+2].text.strip()
opening = containers[i+3].text.strip()
f.write(stock + "," + price + "," + percentage + "," + time + "," + opening + "\n")
f.close()
但是如果桌子长20页,我无法想象用这种方式做,这就是为什么我要寻找“更智能”的东西。
答案 0 :(得分:1)
一种可能性是找到指向下一页a[title="Next"]
的链接。如果该链接不存在,则位于最后一页:
import requests
from bs4 import BeautifulSoup
url = 'https://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=1'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
from textwrap import shorten
page = 1
while True:
print()
print('Page no. {}'.format(page))
print('-' * 80)
for tr in soup.select('tr'):
for td in tr.select('td')[1:]:
txt = td.get_text(strip=True, separator=' ')
print('{: >25}'.format(shorten(txt, 25)), end='')
print()
m = soup.select_one('a[title="Next"][href]')
if m:
url = 'https://www.borsaitaliana.it' + m['href']
soup = BeautifulSoup(requests.get(url).text, 'lxml')
page += 1
else:
break
打印:
Page no. 1
--------------------------------------------------------------------------------
A2a 1.5675 +1.33 17:35:32 1.555 Close
Amplifon 22.30 +1.27 17:35:39 22.00 Close
Atlantia 22.92 +0.26 17:41:55 22.94 Close
Azimut Holding 15.595 +1.93 17:35:48 15.285 Close
Banco Bpm 1.685 +4.04 17:35:58 1.63 Close
Bper Banca 3.078 +2.19 17:35:03 3.022 Close
Buzzi Unicem 18.41 +0.60 17:35:13 18.445 Close
Campari 7.84 +0.71 17:35:03 7.85 Close
Cnh Industrial 7.956 +1.69 17:35:29 7.80 Close
Diasorin 106.00 +1.83 17:35:53 104.10 Close
Enel 6.285 +4.59 17:35:58 6.064 Close
Eni 13.04 -0.47 17:39:49 12.972 Close
Exor 57.16 -1.21 17:35:00 58.02 Close
Ferrari 140.05 -0.11 17:37:09 141.20 Close
Fiat Chrysler Automobiles 11.054 -2.71 17:37:07 11.232 Close
Finecobank 8.656 +1.67 17:35:49 8.67 Close
Generali 15.98 +0.38 17:40:02 15.93 Close
Hera 3.466 +2.79 17:35:06 3.396 Close
Intesa Sanpaolo 1.882 +1.97 17:41:21 1.856 Close
Italgas 5.674 +0.32 17:35:41 5.70 Close
Page no. 2
--------------------------------------------------------------------------------
Juventus Football Club 1.46 +2.21 17:35:42 1.43 Close
Leonardo 10.095 +2.91 17:35:59 9.81 Close
Mediobanca 8.508 +2.14 17:35:33 8.332 Close
Moncler 33.86 -0.85 17:35:25 33.86 Close
Nexi 9.80 +0.00 17:35:04 9.79 Close
Pirelli & C 4.516 -1.07 17:35:24 4.50 Close
Poste Italiane 9.234 +0.98 17:35:24 9.18 Close
Prysmian 17.725 +0.25 17:35:59 17.70 Close
Recordati 38.80 +1.57 17:35:02 38.74 Close
Saipem 4.022 +2.55 17:35:04 3.932 Close
Salvatore Ferragamo 17.145 -1.89 17:35:19 17.425 Close
Snam 4.487 +2.30 17:35:53 4.391 Close
Stmicroelectronics 15.805 +2.10 17:35:48 15.62 Close
Telecom Italia 0.4451 +0.75 17:35:31 0.4438 Close
Tenaris 9.484 +0.51 17:35:49 9.40 Close
Terna - Rete [...] 5.432 +2.22 17:35:55 5.362 Close
Ubi Banca 2.217 +5.62 17:38:45 2.105 Close
Unicredit 9.531 +3.71 17:39:39 9.27 Close
Unipol 4.313 +1.67 17:35:41 4.277 Close
Unipolsai 2.208 -0.32 17:35:03 2.221 Close
答案 1 :(得分:0)
遍历页面上的每个<tr>
标签之后,需要使用href转到下一页。看起来好像是"/borsa/azioni/ftse-mib/lista.html?lang=en&page=2"
,在这种情况下,您只需遍历page=
即可切换到下一页。
如果您发布一些代码,我们可以为您提供更多帮助:)