我正在尝试抓取this Wikipedia page。
我遇到一些问题,非常感谢您的帮助:
某些行具有多个名称或链接,我希望将它们全部分配给正确的国家/地区。反正我能做到吗?
我想跳过“名称(本机)”列。我该怎么办?
如果我要抓取“名称(本机)”列。我有些胡言乱语了,反正还有什么可以编码的?
import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd
url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
table = soup.find('table', class_='wikitable').tbody
rows = table.findAll('tr')
columns = [col.text.encode('utf').replace('\xc2\xa0','').replace('\n', '') for col in rows[1].find_all('td')]
print(columns)
答案 0 :(得分:2)
您可以使用熊猫函数read_html
,并从DataFrame
列表中获得第二个DataFrames
:
url = 'https://en.wikipedia.org/wiki/List_of_government_gazettes'
df = pd.read_html(url)[1].head()
print (df)
Country/region Name \
0 Albania Official Gazette of the Republic of Albania
1 Algeria Official Gazette
2 Andorra Official Bulletin of the Principality of Andorra
3 Antigua and Barbuda Antigua and Barbuda Official Gazette
4 Argentina Official Gazette of the Republic of Argentina
Name (native) Website
0 Fletorja Zyrtare E Republikës Së Shqipërisë qbz.gov.al
1 Journal Officiel d'Algérie joradp.dz/HAR
2 Butlletí Oficial del Principat d'Andorra www.bopa.ad
3 Antigua and Barbuda Official Gazette www.legalaffairs.gov.ag
4 Boletín Oficial de la República Argentina www.boletinoficial.gob.ar
如果检查输出,则存在问题行26
,因为Wiki页面中的数据也错误。
解决方案应按列名和行设置值:
df.loc[26, 'Name (native)'] = np.nan