如何使用BeautifulSoup获取表格中的信息?

时间:2016-06-15 18:44:43

标签: python beautifulsoup

我正在尝试从此网站获取表格中的信息:http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_LaboratoryInformation_S.aspx?Rep=0&RP=Y

当我检查页面时,可以使用类oddrowcolor和evenrowcolor在td中找到数据。但是,当我尝试获取信息时,没有输出任何内容。如何使用BeautifulSoup for Python获取表中的信息?

以下是我的代码:

import requests
from bs4 import BeautifulSoup
url = "http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_LaboratoryInformation_S.aspx?Rep=0&RP=Y"
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

for tr in soup.find_all('tr', {'class':'oddrowcolor'):
    print tr

我试过了奇怪的颜色,但没有输出。

1 个答案:

答案 0 :(得分:2)

您可以使用表格ID来获取表格,但oddrowcolor等等是动​​态添加的,因此它不在源代码中:

import requests
from bs4 import BeautifulSoup
url = "http://indiawater.gov.in/IMISReports/Reports/WaterQuality/rpt_WQM_LaboratoryInformation_S.aspx?Rep=0&RP=Y"
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")
table = soup.select_one("#tableReportTable")

for tr in table.find_all("tr"):
    print tr

要提取表数据,您可以执行以下操作:

soup = BeautifulSoup(r.content, "html.parser")

# gets the table using the table id
table = soup.select_one("#tableReportTable")
# column names
print(", ".join([th.text.strip() for th in table.select_one("tr").find_all("th")]))

#  tr + tr -> gets all the tr tags after the first 
for tr in table.select("tr + tr"):
    # tr.select("td a") -> get all the anchor tags inside the row tds
    # then get the text from each anchor.
    print(",".join([a.text for a in tr.select("td a")]))

这给了你:

S.No., State, State Labs (without mobile labs), District Labs (without mobile labs), Block Labs/Total Blocks (without mobile labs), SubDivision Labs (without mobile labs), Mobile Labs (State/ District/ Block/ Sub-division Level), Total Labs   (State/ District/ Block/ Sub-division Level)

ANDAMAN and NICOBAR,1,0,NA / 9,0,2,3
ANDHRA PRADESH,1,32,NA / 662,73,0,106
ARUNACHAL PRADESH,1,17,NA / 100,31,0,49
ASSAM,1,29,NA / 242,53,20,103
BIHAR,1,41,NA / 536,0,0,42
CHANDIGARH,0,0,NA / 1,0,0,0
CHATTISGARH,1,27,NA / 146,20,5,53
DADRA & NAGAR HAVELI,0,0,NA / 10,0,0,0
DAMAN & DIU,0,0,NA / 1,0,0,0
DELHI,0,0,NA / 0,0,0,0
GOA,1,0,1 / 11,9,0,11
GUJARAT,1,34,50 / 246,0,6,91
HARYANA,0,21,NA / 126,21,0,42
HIMACHAL PRADESH,1,14,NA / 77,28,0,43
JAMMU AND KASHMIR,0,22,2 / 148,74,0,98
JHARKHAND,1,24,NA / 259,3,5,33
KARNATAKA,1,44,39 / 176,106,46,236
KERALA,1,14,NA / 148,33,0,48
LAKSHADWEEP,0,9,NA / 9,0,0,9
MADHYA PRADESH,1,51,3 / 313,106,0,161
MAHARASHTRA,1,44,2 / 351,139,0,186
MANIPUR,1,9,NA / 38,2,0,12
MEGHALAYA,1,7,NA / 42,22,0,30
MIZORAM,1,8,NA / 26,18,0,27
NAGALAND,0,11,NA / 74,1,2,14
ODISHA,1,32,NA / 314,42,0,75
PUDUCHERRY,0,2,NA / 3,0,0,2
PUNJAB,3,22,8 / 145,0,1,34
RAJASTHAN,1,33,163 / 295,0,0,197
SIKKIM,0,2,NA / 9,0,0,2
TAMIL NADU,1,34,NA / 385,49,0,84
TELANGANA,1,19,NA / 438,56,0,76
TRIPURA,1,8,7 / 58,6,0,22
UTTAR PRADESH,1,76,3 / 820,2,0,82
UTTARAKHAND,0,28,1 / 95,14,0,43
WEST BENGAL,1,18,NA / 341,201,0,220

这似乎与我在浏览器中看到的相匹配,Total等..在最后一个tr内的标签中,所以在循环外添加以下内容:

print(",".join([a.text.strip() for a in tr.select("th")])) 

哪会给你:

Total,27,732,279,1109,87,2234