如何使用beautifulSoup分别从标签中抓取数据?

时间:2019-05-27 12:45:24

标签: python python-3.x web-scraping beautifulsoup

我正在尝试从elections.in抓取数据。三个表具有相同的类。以下是网站上的HTML

<h3 class="blmap">17th General (Lok Sabha) Election Results 2019 – State Wise</h3>

<table class="tableizer-table">

<thead><tr class="tableizer-firstrow"><th>State</th><th>Party</th><th>Number of Seats</th></tr></thead><tbody>

 <tr><td>Andaman & Nicobar Islands</td><td>Indian National Congress</td><td>1</td></tr>

 <tr><td>Andhra Pradesh</td><td>Yuvajana Sramika Rythu Congress Party</td><td>22</td></tr>

 <tr><td>Andhra Pradesh</td><td>Telugu Desam</td><td>3</td></tr>

 <tr><td>Arunachal Pradesh</td><td>Bharatiya Janata Party</td><td>2</td></tr>

 <tr><td>Assam</td><td>Bharatiya Janata Party</td><td>9</td></tr>

 <tr><td>Assam</td><td>Indian National Congress</td><td>3</td></tr>

 <tr><td>Assam</td><td>All India United Democratic Front</td><td>1</td></tr>

我能够获取数据,看起来像这样,

    StatePartyNumber of Seats
    Andaman & Nicobar IslandsIndian National Congress1
    Andhra PradeshYuvajana Sramika Rythu Congress Party22
    Andhra PradeshTelugu Desam3
    Arunachal PradeshBharatiya Janata Party2
    AssamBharatiya Janata Party9
    AssamIndian National Congress3
    AssamAll India United Democratic Front1
    AssamIndependent1
    BiharBharatiya Janata Party17

我想要下面的输出,

    State,Party,Number of Seats
    Andaman & Nicobar Islands, Indian National Congress,1
    Andhra Pradesh,Yuvajana Sramika Rythu Congress Party,22

或作为列表。

这行代码为我提供了以上输出

soup.find_all('table')[1].get_text()

这是我的代码Github

请建议如何实现

谢谢。

2 个答案:

答案 0 :(得分:2)

如果您尝试解析<table>标签,请选择熊猫.read_html()。它为您完成了大部分繁重的工作。它将返回数据帧列表。您要引用的表是第3个表(索引位置为2)

import pandas as pd

url="http://www.elections.in/"
tables = pd.read_html(url)

输出:

print (tables[2].to_string())
                        State                                     Party  Number of Seats
0   Andaman & Nicobar Islands                  Indian National Congress                1
1              Andhra Pradesh     Yuvajana Sramika Rythu Congress Party               22
2              Andhra Pradesh                              Telugu Desam                3
3           Arunachal Pradesh                    Bharatiya Janata Party                2
4                       Assam                    Bharatiya Janata Party                9
5                       Assam                  Indian National Congress                3
6                       Assam         All India United Democratic Front                1
7                       Assam                               Independent                1
8                       Bihar                    Bharatiya Janata Party               17
9                       Bihar                       Janata Dal (United)               16
10                      Bihar                      Lok Jan Shakti Party                6
11                      Bihar                  Indian National Congress                1
12                 Chandigarh                    Bharatiya Janata Party                1
13               Chhattisgarh                    Bharatiya Janata Party                9
14               Chhattisgarh                  Indian National Congress                2
15       Dadra & Nagar Haveli                               Independent                1
16                Daman & Diu                    Bharatiya Janata Party                1
17                        Goa                    Bharatiya Janata Party                1
18                        Goa                  Indian National Congress                1
19                    Gujarat                    Bharatiya Janata Party               26
20                    Haryana                    Bharatiya Janata Party               10
21           Himachal Pradesh                    Bharatiya Janata Party                4
22            Jammu & Kashmir                    Bharatiya Janata Party                3
23            Jammu & Kashmir       Jammu & Kashmir National Conference                3
24                  Jharkhand                    Bharatiya Janata Party               11
25                  Jharkhand                                Ajsu Party                1
26                  Jharkhand                  Indian National Congress                1
27                  Jharkhand                    Jharkhand Mukti Morcha                1
28                  Karnataka                    Bharatiya Janata Party               25
29                  Karnataka                               Independent                1
30                  Karnataka                  Indian National Congress                1
31                  Karnataka                      Janata Dal (Secular)                1
32                     Kerala                  Indian National Congress               15
33                     Kerala                Indian Union Muslim League                2
34                     Kerala        Communist Party Of India (Marxist)                1
35                     Kerala                       Kerala Congress (M)                1
36                     Kerala             Revolutionary Socialist Party                1
37                Lakshadweep                Nationalist Congress Party                1
38             Madhya Pradesh                    Bharatiya Janata Party               28
39             Madhya Pradesh                  Indian National Congress                1
40                Maharashtra                    Bharatiya Janata Party               23
41                Maharashtra                                  Shivsena               18
42                Maharashtra                Nationalist Congress Party                4
43                Maharashtra    All India Majlis-E-Ittehadul Muslimeen                1
44                Maharashtra                               Independent                1
45                Maharashtra                  Indian National Congress                1
46                    Manipur                    Bharatiya Janata Party                1
47                    Manipur                        Naga Peoples Front                1
48                  Meghalaya                  Indian National Congress                1
49                  Meghalaya                   National People'S Party                1
50                    Mizoram                       Mizo National Front                1
51                   Nagaland  Nationalist Democratic Progressive Party                1
52               NCT OF Delhi                    Bharatiya Janata Party                7
53                     Odisha                           Biju Janata Dal               12
54                     Odisha                    Bharatiya Janata Party                8
55                     Odisha                  Indian National Congress                1
56                 Puducherry                  Indian National Congress                1
57                     Punjab                  Indian National Congress                8
58                     Punjab                    Bharatiya Janata Party                2
59                     Punjab                       Shiromani Akali Dal                2
60                     Punjab                           Aam Aadmi Party                1
61                  Rajasthan                    Bharatiya Janata Party               24
62                  Rajasthan                Rashtriya Loktantrik Party                1
63                     Sikkim                  Sikkim Krantikari Morcha                1
64                 Tamil Nadu                 Dravida Munnetra Kazhagam               23
65                 Tamil Nadu                  Indian National Congress                8
66                 Tamil Nadu                  Communist Party Of India                2
67                 Tamil Nadu        Communist Party Of India (Marxist)                2
68                 Tamil Nadu  All India Anna Dravida Munnetra Kazhagam                1
69                 Tamil Nadu                Indian Union Muslim League                1
70                 Tamil Nadu            Viduthalai Chiruthaigal Katchi                1
71                  Telangana                 Telangana Rashtra Samithi                9
72                  Telangana                    Bharatiya Janata Party                4
73                  Telangana                  Indian National Congress                3
74                  Telangana    All India Majlis-E-Ittehadul Muslimeen                1
75                    Tripura                    Bharatiya Janata Party                2
76              Uttar Pradesh                    Bharatiya Janata Party               62
77              Uttar Pradesh                       Bahujan Samaj Party               10
78              Uttar Pradesh                           Samajwadi Party                5
79              Uttar Pradesh                       Apna Dal (Soneylal)                2
80              Uttar Pradesh                  Indian National Congress                1
81                Uttarakhand                    Bharatiya Janata Party                5
82                West Bengal              All India Trinamool Congress               22
83                West Bengal                    Bharatiya Janata Party               18
84                West Bengal                  Indian National Congress        

2    

要使用BeautifulSoup实现此目的,您必须遍历每一行(标记<tr>),然后遍历每一行的每个数据单元格标记(<td>),然后将其附加到列表或数据框,或者您想存储它的方式。

是这样的:

import requests
import os
from bs4 import BeautifulSoup

url="http://www.elections.in/"

r=requests.get(url).content
htmlDoc=r.decode("utf-8")

soup = BeautifulSoup(htmlDoc, 'html.parser')

table = soup.find_all('table')[2]
rows = table.find_all('tr')

headers = table.find_all('th')
headers = [ each.text for each in headers ]

list_of_rows = []
for row in rows:
    data = row.find_all('td')
    if data != []:
        data = [ each.text for each in data ]
        list_of_rows.append(data)

输出:

print (headers)
['State', 'Party', 'Number of Seats']

print (list_of_rows)
[['Andaman & Nicobar Islands', 'Indian National Congress', '1'], ['Andhra Pradesh', 'Yuvajana Sramika Rythu Congress Party', '22'], ['Andhra Pradesh', 'Telugu Desam', '3'], ['Arunachal Pradesh', 'Bharatiya Janata Party', '2'], ['Assam', 'Bharatiya Janata Party', '9'], ['Assam', 'Indian National Congress', '3'], ['Assam', 'All India United Democratic Front', '1'], ['Assam', 'Independent', '1'], ['Bihar', 'Bharatiya Janata Party', '17'], ['Bihar', 'Janata Dal (United)', '16'], ['Bihar', 'Lok Jan Shakti Party', '6'], ['Bihar', 'Indian National Congress', '1'], ['Chandigarh', 'Bharatiya Janata Party', '1'], ['Chhattisgarh', 'Bharatiya Janata Party', '9'], ['Chhattisgarh', 'Indian National Congress', '2'], ['Dadra & Nagar Haveli', 'Independent', '1'], ['Daman & Diu', 'Bharatiya Janata Party', '1'], ['Goa', 'Bharatiya Janata Party', '1'], ['Goa', 'Indian National Congress', '1'], ['Gujarat', 'Bharatiya Janata Party', '26'], ['Haryana', 'Bharatiya Janata Party', '10'], ['Himachal Pradesh', 'Bharatiya Janata Party', '4'], ['Jammu & Kashmir', 'Bharatiya Janata Party', '3'], ['Jammu & Kashmir', 'Jammu & Kashmir National Conference', '3'], ['Jharkhand', 'Bharatiya Janata Party', '11'], ['Jharkhand', 'Ajsu Party', '1'], ['Jharkhand', 'Indian National Congress', '1'], ['Jharkhand', 'Jharkhand Mukti Morcha', '1'], ['Karnataka', 'Bharatiya Janata Party', '25'], ['Karnataka', 'Independent', '1'], ['Karnataka', 'Indian National Congress', '1'], ['Karnataka', 'Janata Dal (Secular)', '1'], ['Kerala', 'Indian National Congress', '15'], ['Kerala', 'Indian Union Muslim League', '2'], ['Kerala', 'Communist Party Of India (Marxist)', '1'], ['Kerala', 'Kerala Congress (M)', '1'], ['Kerala', 'Revolutionary Socialist Party', '1'], ['Lakshadweep', 'Nationalist Congress Party', '1'], ['Madhya Pradesh', 'Bharatiya Janata Party', '28'], ['Madhya Pradesh', 'Indian National Congress', '1'], ['Maharashtra', 'Bharatiya Janata Party', '23'], ['Maharashtra', 'Shivsena', '18'], ['Maharashtra', 'Nationalist Congress Party', '4'], ['Maharashtra', 'All India Majlis-E-Ittehadul Muslimeen', '1'], ['Maharashtra', 'Independent', '1'], ['Maharashtra', 'Indian National Congress', '1'], ['Manipur', 'Bharatiya Janata Party', '1'], ['Manipur', 'Naga Peoples Front', '1'], ['Meghalaya', 'Indian National Congress', '1'], ['Meghalaya', "National People'S Party", '1'], ['Mizoram', 'Mizo National Front', '1'], ['Nagaland', 'Nationalist Democratic Progressive Party', '1'], ['NCT OF Delhi', 'Bharatiya Janata Party', '7'], ['Odisha', 'Biju Janata Dal', '12'], ['Odisha', 'Bharatiya Janata Party', '8'], ['Odisha', 'Indian National Congress', '1'], ['Puducherry', 'Indian National Congress', '1'], ['Punjab', 'Indian National Congress', '8'], ['Punjab', 'Bharatiya Janata Party', '2'], ['Punjab', 'Shiromani Akali Dal', '2'], ['Punjab', 'Aam Aadmi Party', '1'], ['Rajasthan', 'Bharatiya Janata Party', '24'], ['Rajasthan', 'Rashtriya Loktantrik Party', '1'], ['Sikkim', 'Sikkim Krantikari Morcha', '1'], ['Tamil Nadu', 'Dravida Munnetra Kazhagam', '23'], ['Tamil Nadu', 'Indian National Congress', '8'], ['Tamil Nadu', 'Communist Party Of India', '2'], ['Tamil Nadu', 'Communist Party Of India (Marxist)', '2'], ['Tamil Nadu', 'All India Anna Dravida Munnetra Kazhagam', '1'], ['Tamil Nadu', 'Indian Union Muslim League', '1'], ['Tamil Nadu', 'Viduthalai Chiruthaigal Katchi', '1'], ['Telangana', 'Telangana Rashtra Samithi', '9'], ['Telangana', 'Bharatiya Janata Party', '4'], ['Telangana', 'Indian National Congress', '3'], ['Telangana', 'All India Majlis-E-Ittehadul Muslimeen', '1'], ['Tripura', 'Bharatiya Janata Party', '2'], ['Uttar Pradesh', 'Bharatiya Janata Party', '62'], ['Uttar Pradesh', 'Bahujan Samaj Party', '10'], ['Uttar Pradesh', 'Samajwadi Party', '5'], ['Uttar Pradesh', 'Apna Dal (Soneylal)', '2'], ['Uttar Pradesh', 'Indian National Congress', '1'], ['Uttarakhand', 'Bharatiya Janata Party', '5'], ['West Bengal', 'All India Trinamool Congress', '22'], ['West Bengal', 'Bharatiya Janata Party', '18'], ['West Bengal', 'Indian National Congress', '2']]

但是就像我说的那样,大熊猫会用.read_html()

为您做到这一点

答案 1 :(得分:1)

BeautifulSoup解决方案略短:

from bs4 import BeautifulSoup as soup
d = soup(content, 'html.parser')
headers, data = [i.text for i in d.find_all('th')], [[i.text for i in b.find_all('td')] for b in d.find_all('tr')[1:]]

输出:

['State', 'Party', 'Number of Seats']
[['Andaman & Nicobar Islands', 'Indian National Congress', '1'], ['Andhra Pradesh', 'Yuvajana Sramika Rythu Congress Party', '22'], ['Andhra Pradesh', 'Telugu Desam', '3'], ['Arunachal Pradesh', 'Bharatiya Janata Party', '2'], ['Assam', 'Bharatiya Janata Party', '9'], ['Assam', 'Indian National Congress', '3'], ['Assam', 'All India United Democratic Front', '1']]

要写入csv

import csv
with open('election_results.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([headers, *data])

输出:

State,Party,Number of Seats
Andaman & Nicobar Islands,Indian National Congress,1
Andhra Pradesh,Yuvajana Sramika Rythu Congress Party,22
Andhra Pradesh,Telugu Desam,3
Arunachal Pradesh,Bharatiya Janata Party,2
Assam,Bharatiya Janata Party,9
Assam,Indian National Congress,3
Assam,All India United Democratic Front,1