如何用BeautifulSoup刮掉这个页面?

时间:2016-05-29 13:26:48

标签: web-scraping beautifulsoup

我正在尝试使用BeautifulSoup中的以下代码来抓取下面的页面

import requests
from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml

url = 'https://remittanceprices.worldbank.org/en/corridor/Australia/China'
page=urlopen(url)
bs = BeautifulSoup(page,"lxml")

print(bs.get_text())

all_links=bs.find_all("div", {"class":"views-field views-field-title" })
for link in all_links:       
    content=link.get_text()
    print (content)
all_links=bs.find_all("div", {"class":"mobile-header" })
for link in all_links:
    content=link.get_text()
    print (content)

您能否提供一些指导,以下列格式打印/提取所有公司的数据

Firm|product|Fee|Exchange rate margin(%)|Total Cost Percent(%)|Total Cost(AUD)
Bank of China|28.00|5.77|19.77|39.54
ANZ Bank|32.00|4.39|20.39|40.78

此致 -Abacus

1 个答案:

答案 0 :(得分:1)

import requests
from bs4 import BeautifulSoup


url = 'https://remittanceprices.worldbank.org/en/corridor/Australia/China'
r = requests.get(url,verify=False)
soup = BeautifulSoup(r.text,'lxml')
rows = [i.get_text("|").split("|") for i in soup.select('#tab-1 .corridor-row')]
for row in rows:
    #a,b,c,d,e = row[2],row[15],row[18],row[21],row[25]
    #print(a,b,c,d,e,sep='|')
    print('{0[2]}|{0[15]}|{0[18]}|{0[21]}|{0[25]}'.format(row))

Citibank|0.00|1.53|1.53|3.06
Transferwise|5.05|-0.04|2.48|4.96
Western Union|5.00|1.19|3.69|7.38
MoneyGram|8.00|1.06|5.06|10.12
WorldRemit|7.99|1.30|5.30|10.60
Ria|10.00|0.84|5.84|11.68
Ceylon Exchange|10.00|1.37|6.37|12.74
Western Union|9.95|1.69|6.66|13.32
Orbit Remit|13.00|0.78|7.28|14.56
Money2anywhere|12.00|1.71|7.71|15.42
SUPAY|18.00|-1.24|7.76|15.52
Money Chain Foreign Exchange|18.00|-1.12|7.88|15.76
MoneyGram|15.00|1.30|8.80|17.60
Commonwealth Bank|22.00|3.43|14.43|28.86
Bank of China|28.00|1.50|15.50|31.00
ANZ Bank|24.00|4.51|16.51|33.02
National Australia Bank (NAB)|22.00|5.74|16.74|33.48
Bank of China|32.00|1.50|17.50|35.00
Commonwealth Bank|30.00|3.43|18.43|36.86
ANZ Bank|32.00|4.51|20.51|41.02
National Australia Bank (NAB)|30.00|5.74|20.74|41.48