我从Wikipedia页面上抓了一张桌子,接下来我将清理数据。我已经将数据转换为Pandas格式,现在在清理数据时遇到一些问题
这是我执行的从Wikipedia页面抓取表格的代码
import requests
import pandas as pd
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
PostalCode=[]
for row in My_table.findAll('tr')[1:]:
PostalCode_cell=row.findAll('td')[0]
PostalCode.append(PostalCode_cell.text)
print(PostalCode)
Borough=[]
for row in My_table.findAll('tr')[1:] :
Borough_cell=row.findAll('td')[1]
Borough.append(Borough_cell.text)
print(Borough)
Neighbourhood=[]
for row in My_table.findAll('tr')[1:]:
Neighbourhood_cell=row.findAll('td')[2]
Neighbourhood_cell.text.rstrip('\n')
Neighbourhood.append(Neighbourhood_cell.text)
print(Neighbourhood)
canada=pd.DataFrame({'PostalCode':PostalCode,'Borough':Borough,'Neighborhood':Neighbourhood})
canada.rename(columns = {'PostalCode':'PostalCode','Borough':'Borough','Neighborhood':'Neighborhood'}, inplace = True)
canada
我尝试了groupby函数,希望获得第二个期望的结果,但是没有解决:
canada.groupby(['PostalCode', 'Borough'])
我尝试从自治市镇中删除“未分配”值:
canada=canada.Borough.drop("Not assigned",axis=0)
但显示为:“在轴中找不到[['Unsigned']]”
这是我清除的数据的预期结果: 1.忽略自治市镇中值为“未分配”的单元格 2.对于具有相同邮政编码和自治市镇的街区,它们应显示在同一行中并以逗号分隔 3.如果一个单元有自治市镇但有一个“未分配”邻域,则 邻居将与自治市镇相同
而且,我注意到我抓取的表在“邻居”中每个值的末尾都包含“ \ n”。我应该在抓取过程中添加任何代码来摆脱它吗?
非常感谢您的提前帮助。
答案 0 :(得分:0)
这感觉有点漫长。
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
canada = tables[0]
canada.columns = canada.iloc[0]
canada = canada.iloc[1:]
canada = canada[canada.Borough != 'Not assigned']
canada['Neighbourhood'].loc[canada['Neighbourhood'] == 'Not assigned'] = canada.Borough
canada['Location'] = canada.Borough + ', ' + canada.Neighbourhood
canada.drop(['Borough', 'Neighbourhood'], axis=1, inplace = True)
canada.reset_index(drop=True)
参考文献:
https://stackoverflow.com/a/49161313/6241235
编辑:
我认为@bubble关于不区分大小写的搜索的观点很好,他们说canada = canada[canada.loc[:, 'Borough'].str.contains('Not assigned', case=False)]
,但我没有想到)