https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7
如何从每列(即,每个<div class = "col-currency-rate">
的类为<div class = "row">
的最后三个块中获取文本)?我拿到桌子了,但是下一步该怎么办?
>>> tree.xpath('//div[@class="table-currency"]/div[@class="row"]')
[<Element div at 0x7fcac2a47ba8>, <Element div at 0x7fcac2a47c00>, <Element div at 0x7fcac2a47c58>, <Element div at 0x7fcac2a47cb0>, <Element div at 0x7fcac2a47d08>, <Element div at 0x7fcac2a47d60>, <Element div at 0x7fcac2a47db8>, <Element div at 0x7fcac2a47e10>, <Element div at 0x7fcac2a47e68>, <Element div at 0x7fcac2a47ec0>, <Element div at 0x7fcac2a47f18>, <Element div at 0x7fcac2a47f70>, <Element div at 0x7fcac2a47fc8>, <Element div at 0x7fcac2a4e050>, <Element div at 0x7fcac2a4e0a8>, <Element div at 0x7fcac2a4e100>, <Element div at 0x7fcac2a4e158>, <Element div at 0x7fcac2a4e1b0>, <Element div at 0x7fcac2a4e208>, <Element div at 0x7fcac2a4e260>, <Element div at 0x7fcac2a4e2b8>, <Element div at 0x7fcac2a4e310>, <Element div at 0x7fcac2a4e368>, <Element div at 0x7fcac2a4e3c0>, <Element div at 0x7fcac2a4e418>, <Element div at 0x7fcac2a4e470>, <Element div at 0x7fcac2a4e4c8>, <Element div at 0x7fcac2a4e520>]
>>> len(tree.xpath('//div[@class="table-currency"]/div[@class="row"]'))
28
html
<div class="table-currency">
<div class="row"><div class="col col-currency">
2.
<img rel="nofollow" src="https://st6.prosto.im/cache/st6/1/0/5/5/1055/1055.jpg" width="16" height="16" alt="">
<a target="_blank" href="/spravochniki/reytingi_banka/2/1057">
ForteBank
</a></div><div class="col col-headery col-currency-rate"><p>Активы банков, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост за июль 2019 года, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост с начала 2019 года, тыс. тенге</p></div><div class="col col-currency-rate"><p>1 985 956 865</p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+89 298 547</p><p></p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+390 999 868</p><p></p></div></div>
<div class="row"><div class="col col-currency">
3.
<img rel="nofollow" src="https://st6.prosto.im/cache/st6/1/0/9/5/1095/1095.png" width="16" height="16" alt="">
<a target="_blank" href="/spravochniki/reytingi_banka/2/1076">
Сбербанк России
</a></div><div class="col col-headery col-currency-rate"><p>Активы банков, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост за июль 2019 года, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост с начала 2019 года, тыс. тенге</p></div><div class="col col-currency-rate"><p>1 983 840 092</p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+88 853 745</p><p></p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+119 145 827</p><p></p></div></div>
</div>
答案 0 :(得分:1)
具有特定 Xpath 表达式的复杂解决方案:
from lxml import html
import requests
url = 'https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7'
doc = html.document_fromstring(requests.get(url).content)
for row in doc.xpath('//div[@class="table-currency"]/div[@class="row"]'):
bank_name = row.xpath('descendant::a/text()')[0].strip()
print(bank_name)
for cur_rate in row.xpath('div[contains(@class, "col-currency-rate")][position() > last() - 3]'):
print('-', cur_rate.text_content())
print()
详细信息:
descendant::a/text()
-xpath提取a
元素的文本节点,该元素是带下划线的行的子节点/后代节点div[contains(@class, "col-currency-rate")][position() > last() - 3]
-用于选择div
元素的xpath,这些元素具有特定的class
属性部分值,并且位置从最后一个 3 位置开始( last()
-最后一个元素的位置,last() - 3
指向最后一个第3个位置)输出:
Народный банк Казахстана
- 8 729 518 087
- +101 401 107
- -190 957 466
ForteBank
- 1 985 956 865
- +89 298 547
- +390 999 868
Сбербанк России
- 1 983 840 092
- +88 853 745
- +119 145 827
Kaspi Bank
- 1 907 391 103
- +12 378 770
- +233 318 909
Банк ЦентрКредит
- 1 495 599 542
- +34 795 443
- -14 202 851
АТФБанк
- 1 314 405 536
- +1 661 967
- -19 558 254
First Heartland Jýsan Bank
- 1 217 617 065
- +52 641 777
- -553 564 176
Жилстройсбербанк Казахстана
- 1 148 974 349
- +7 721 823
- +261 041 394
Евразийский банк
- 1 040 820 999
- -25 910 447
- -25 911 373
Ситибанк Казахстан
- 758 117 020
- +48 724 924
- +82 877 576
Банк "Bank RBK"
- 618 310 738
- +21 856 874
- +62 626 834
Альфа-Банк
- 504 777 556
- +17 401 839
- +51 157 130
Altyn Bank («Народный банк Казахстана»)
- 421 018 633
- -20 058 555
- +33 720 048
Нурбанк
- 408 442 557
- +7 065 511
- -18 282 545
Хоум Кредит энд Финанс Банк
- 372 901 871
- -2 127 105
- +33 983 288
Банк Китая в Казахстане
- 324 386 349
- +11 609 880
- +4 997 316
Банк ВТБ
- 184 247 490
- +5 800 194
- +40 725 927
First Heartland Bank (Банк ЭкспоКреди)
- 173 058 018
- -17 261 535
- +16 047 168
Торгово-промышленный Банк Китая в Алматы
- 140 792 847
- +6 365 348
- -26 137 736
Банк Kassa Nova
- 133 910 512
- +954 985
- +4 039 523
Tengri Bank (Punjab National Bank)
- 133 721 602
- +1 136 896
- -485 570
Азия Кредит Банк
- 99 659 306
- -3 790 116
- -21 420 844
Capital Bank Kazakhstan
- 85 702 895
- -3 165 322
- +4 469 187
KZI Bank (Казахстан Зират Интернешнл)
- 65 240 704
- -3 412 060
- -126 750
Шинхан Банк Казахстан
- 43 323 406
- -7 588 366
- +722 399
Исламский Банк "Al-Hilal"
- 30 562 279
- +2 411 098
- -1 430 198
Заман-Банк
- 22 969 984
- -168 105
- +5 544 675
Национальный Банк Пакистана
- 4 705 084
- -20 113
- -131 233
答案 1 :(得分:0)
尝试使用
import requests
import bs4 as bs
base_url = 'https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7'
soup = bs.BeautifulSoup(requests.get(base_url).text, 'lxml')
res = soup.find_all('div', {'class': 'row'})
final = list()
# res[1:] to skip the header of the columns
for bank in res[1:]:
bank_data = list()
# Bank name
bank_data.append(bank.find('a').text.strip('\n'))
# Image
bank_data.append(bank.find('img')['src'])
res = bank.find_all('div', {'class': 'col col-currency-rate'})
for values in res:
data = values.find_all('p')
for x in data:
if x.text:
# All the three values
bank_data.append(x.text)
final.append(bank_data)
for x in final:
print(x)
检查是否适合您。