我正在尝试使用read_html
import requests
import pandas as pd
import numpy as np
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
resp = requests.get(url)
tables = pd.read_html(resp.text)
但是我得到这个错误
IndexError:列表索引超出范围
其他Wiki页面工作正常。 page怎么了,我该如何解决以上错误?
答案 0 :(得分:1)
由于jquery表排序器,似乎无法读取该表。 在处理jquery而不是纯html时,很容易将带有硒库的表读入df中。您仍然需要进行一些清理,但这会将表放入df中。
您还需要安装硒库并下载Web浏览器驱动程序。
from selenium import webdriver
driver = r'C:\chromedriver_win32\chromedriver.exe'
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_intentional_homicide_rate'
driver = webdriver.Chrome(driver)
driver.get(url)
the_table = driver.find_element_by_xpath('//*[@id="mw-content-text"]/div/table[2]/tbody/tr/td[2]/table')
data = the_table.text
df = pd.DataFrame([x.split() for x in data.split('\n')])
driver.close()
print(df)
0 1 2 3 4 5 \
0 Country (or dependent territory, None None
1 subnational area, etc.) Region Subregion Rate
2 listed Source None None None None
3 None None None None None None
4 Burundi Africa Eastern Africa 6.02 635
5 Comoros Africa Eastern Africa 7.70 60
6 Djibouti Africa Eastern Africa 6.48 60
7 Eritrea Africa Eastern Africa 8.04 390
8 Ethiopia Africa Eastern Africa 7.56 7,552
9 Kenya Africa Eastern Africa 5.00 2,466
10 Madagascar Africa Eastern Africa 7.69 1,863