尝试使这段代码工作:(使用BeautifulSoup进行网页抓取示例)
import urllib2
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
page = urllib2.urlopen(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page)
我收到此错误: -
URLError: <urlopen error [Errno 10061] No connection could be made because the target machine actively refused it>
我想这与某些防火墙/安全相关的问题有关,有人可以帮忙做些什么吗?
答案 0 :(得分:1)
您可以使用requests
尝试类似的内容:
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
page = requests.get(wiki).content
soup = BeautifulSoup(page)
如果您想要获得该表,您可以像这样使用pandas:
import pandas as pd
wiki = "https://en.wikipedia.org/wiki/List_of_state_and_union_territory_capitals_in_India"
df = pd.read_html(wiki)[1]
df2 = df.copy()
df2.columns = df.iloc[0]
df2.drop(0, inplace=True)
df2.drop('No.', axis=1, inplace=True)
df2.head()
输出: