我如何使用lxml获取元素

时间:2019-08-01 09:12:48

标签: python parsing xpath lxml

https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7

如何从每列(即,每个<div class = "col-currency-rate">的类为<div class = "row">的最后三个块中获取文本)?我拿到桌子了,但是下一步该怎么办?

>>> tree.xpath('//div[@class="table-currency"]/div[@class="row"]')
[<Element div at 0x7fcac2a47ba8>, <Element div at 0x7fcac2a47c00>, <Element div at 0x7fcac2a47c58>, <Element div at 0x7fcac2a47cb0>, <Element div at 0x7fcac2a47d08>, <Element div at 0x7fcac2a47d60>, <Element div at 0x7fcac2a47db8>, <Element div at 0x7fcac2a47e10>, <Element div at 0x7fcac2a47e68>, <Element div at 0x7fcac2a47ec0>, <Element div at 0x7fcac2a47f18>, <Element div at 0x7fcac2a47f70>, <Element div at 0x7fcac2a47fc8>, <Element div at 0x7fcac2a4e050>, <Element div at 0x7fcac2a4e0a8>, <Element div at 0x7fcac2a4e100>, <Element div at 0x7fcac2a4e158>, <Element div at 0x7fcac2a4e1b0>, <Element div at 0x7fcac2a4e208>, <Element div at 0x7fcac2a4e260>, <Element div at 0x7fcac2a4e2b8>, <Element div at 0x7fcac2a4e310>, <Element div at 0x7fcac2a4e368>, <Element div at 0x7fcac2a4e3c0>, <Element div at 0x7fcac2a4e418>, <Element div at 0x7fcac2a4e470>, <Element div at 0x7fcac2a4e4c8>, <Element div at 0x7fcac2a4e520>]
>>> len(tree.xpath('//div[@class="table-currency"]/div[@class="row"]'))
28

html

<div class="table-currency">
    <div class="row"><div class="col col-currency">
    2.&nbsp; &nbsp;
    <img rel="nofollow" src="https://st6.prosto.im/cache/st6/1/0/5/5/1055/1055.jpg" width="16" height="16" alt="">
    <a target="_blank" href="/spravochniki/reytingi_banka/2/1057">
    ForteBank
    </a></div><div class="col col-headery col-currency-rate"><p>Активы банков, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост за июль 2019 года,  тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост с начала 2019 года,  тыс. тенге</p></div><div class="col col-currency-rate"><p>1 985 956 865</p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+89 298 547</p><p></p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+390 999 868</p><p></p></div></div>

    <div class="row"><div class="col col-currency">
    3.&nbsp; &nbsp;
    <img rel="nofollow" src="https://st6.prosto.im/cache/st6/1/0/9/5/1095/1095.png" width="16" height="16" alt="">
    <a target="_blank" href="/spravochniki/reytingi_banka/2/1076">
    Сбербанк России
    </a></div><div class="col col-headery col-currency-rate"><p>Активы банков, тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост за июль 2019 года,  тыс. тенге</p></div><div class="col col-headery col-currency-rate"><p>Прирост с начала 2019 года,  тыс. тенге</p></div><div class="col col-currency-rate"><p>1 983 840 092</p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+88 853 745</p><p></p></div><div class="col col-currency-rate"><p></p><p class="arrow-up">+119 145 827</p><p></p></div></div>
</div>

enter image description here

2 个答案:

答案 0 :(得分:1)

具有特定 Xpath 表达式的复杂解决方案:

from lxml import html
import requests

url  = 'https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7'
doc = html.document_fromstring(requests.get(url).content)

for row in doc.xpath('//div[@class="table-currency"]/div[@class="row"]'):
    bank_name = row.xpath('descendant::a/text()')[0].strip()
    print(bank_name)
    for cur_rate in row.xpath('div[contains(@class, "col-currency-rate")][position() > last() - 3]'):
        print('-', cur_rate.text_content())
    print()

详细信息:

  • descendant::a/text()-xpath提取a元素的文本节点,该元素是带下划线的行的子节点/后代节点
  • div[contains(@class, "col-currency-rate")][position() > last() - 3]-用于选择div元素的xpath,这些元素具有特定的class属性部分值,并且位置从最后一个 3 位置开始( last()-最后一个元素的位置,last() - 3指向最后一个第3个位置)

输出:

Народный банк Казахстана
- 8 729 518 087
- +101 401 107
- -190 957 466

ForteBank
- 1 985 956 865
- +89 298 547
- +390 999 868

Сбербанк России
- 1 983 840 092
- +88 853 745
- +119 145 827

Kaspi Bank
- 1 907 391 103
- +12 378 770
- +233 318 909

Банк ЦентрКредит
- 1 495 599 542
- +34 795 443
- -14 202 851

АТФБанк
- 1 314 405 536
- +1 661 967
- -19 558 254

First Heartland Jýsan Bank
- 1 217 617 065
- +52 641 777
- -553 564 176

Жилстройсбербанк Казахстана
- 1 148 974 349
- +7 721 823
- +261 041 394

Евразийский банк
- 1 040 820 999
- -25 910 447
- -25 911 373

Ситибанк Казахстан
- 758 117 020
- +48 724 924
- +82 877 576

Банк "Bank RBK"
- 618 310 738
- +21 856 874
- +62 626 834

Альфа-Банк
- 504 777 556
- +17 401 839
- +51 157 130

Altyn Bank («Народный банк Казахстана»)
- 421 018 633
- -20 058 555
- +33 720 048

Нурбанк
- 408 442 557
- +7 065 511
- -18 282 545

Хоум Кредит энд Финанс Банк
- 372 901 871
- -2 127 105
- +33 983 288

Банк Китая в Казахстане
- 324 386 349
- +11 609 880
- +4 997 316

Банк ВТБ
- 184 247 490
- +5 800 194
- +40 725 927

First Heartland Bank (Банк ЭкспоКреди)
- 173 058 018
- -17 261 535
- +16 047 168

Торгово-промышленный Банк Китая в Алматы
- 140 792 847
- +6 365 348
- -26 137 736

Банк Kassa Nova
- 133 910 512
- +954 985
- +4 039 523

Tengri Bank (Punjab National Bank)
- 133 721 602
- +1 136 896
- -485 570

Азия Кредит Банк
- 99 659 306
- -3 790 116
- -21 420 844

Capital Bank Kazakhstan
- 85 702 895
- -3 165 322
- +4 469 187

KZI Bank (Казахстан Зират Интернешнл)
- 65 240 704
- -3 412 060
- -126 750

Шинхан Банк Казахстан
- 43 323 406
- -7 588 366
- +722 399

Исламский Банк "Al-Hilal"
- 30 562 279
- +2 411 098
- -1 430 198

Заман-Банк
- 22 969 984
- -168 105
- +5 544 675

Национальный Банк Пакистана
- 4 705 084
- -20 113
- -131 233

答案 1 :(得分:0)

尝试使用

import requests
import bs4 as bs
base_url = 'https://bankchart.kz/spravochniki/reytingi_cbr/2/2019/7'
soup = bs.BeautifulSoup(requests.get(base_url).text, 'lxml')
res = soup.find_all('div', {'class': 'row'})

final = list()
# res[1:] to skip the header of the columns
for bank in res[1:]:
    bank_data = list()
    # Bank name
    bank_data.append(bank.find('a').text.strip('\n'))
    # Image
    bank_data.append(bank.find('img')['src'])
    res = bank.find_all('div', {'class': 'col col-currency-rate'})
    for values in res:
        data = values.find_all('p')
        for x in data:
            if x.text:
                # All the three values
                bank_data.append(x.text)
    final.append(bank_data)
for x in final:
    print(x)

检查是否适合您。