如何解决索引错误并从网页中抓取数据

时间:2019-05-10 10:51:06

标签: python-3.x pandas web-scraping beautifulsoup

我想使用熊猫从Wayback机器上从网页中抓取数据。我使用字符串拆分来拆分某些字符串(如果存在)。

该网页的网址为this

这是我的代码:

import pandas as pd

url =  "https://web.archive.org/web/20140528015357/http://eciresults.nic.in/statewiseS26.htm"
dfs = pd.read_html(url)

df = dfs[0]

idx = df[df[0] == '\xa0Next >>'].index[0]
# Error mentioned in comment happens on the above line.


cols = list(df.iloc[idx-1,:])
df.columns = cols

df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')

df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]


df.to_csv('Chhattisgarh_cand.csv', index=False)

该网页的预期输出必须为csv格式,例如Output

2 个答案:

答案 0 :(得分:1)

您可以使用BeautifulSoup提取数据。 Panadas将帮助您以有效的方式处理数据,但对于数据提取而言并非如此。

import pandas as pd
from bs4 import BeautifulSoup
import requests
response = requests.get('https://web.archive.org/web/20140528015357/http://eciresults.nic.in/statewiseS26.htm?st=S26')
soup = BeautifulSoup(response.text,'lxml')
table_data = []
required_table = [table for table in soup.find_all('table') if str(table).__contains__('Indian National Congress')]
if required_table:
    for tr_tags in required_table[0].find_all('tr',{'style':'font-size:12px;'}):
        td_data = []
        for td_tags in tr_tags.find_all('td'):
            td_data.append(td_tags.text.strip())
        table_data.append(td_data)
df = pd.DataFrame(table_data[1:])
# print(df.head())
df.to_csv("DataExport.csv",index=False)

您可以在pandas数据框中获得这样的结果

                0   1  ...       6                7
0        BILASPUR   5  ...  176436  Result Declared
1            DURG   7  ...   16848  Result Declared
2  JANJGIR-CHAMPA   3  ...  174961  Result Declared
3          KANKER  11  ...   35158  Result Declared
4           KORBA   4  ...    4265  Result Declared

答案 1 :(得分:0)

以下代码应结合使用BS和熊猫将您的url链接(“ Chhattisgarh结果状态”)中的表格获取;然后可以将其另存为csv:

from bs4 import BeautifulSoup
import urllib.request
import pandas as pd

url =  "https://web.archive.org/web/20140528015357/http://eciresults.nic.in/statewiseS26.htm?st=S26"

response = urllib.request.urlopen(url)
elect = response.read()
soup = BeautifulSoup(elect,"lxml")
res = soup.find_all('table')
df = pd.read_html(str(res[7]))
df[3]