我正在取消英格兰的联合数据,并且当我一次做一家医院时,我会以正确的格式获得结果。我最终希望迭代所有医院,但首先决定制作一个由三家不同医院组成的阵列并计算出迭代次数。
当我只有一家医院时,下面的代码为我提供了pandas DataFrame中最终结果的正确格式:
import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
r=requests.get("http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?
hospitalName=Norfolk%20and%20Norwich%20Hospital")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
temp = []
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text.replace(" ",""))
i=i+1
table = np.array(temp).reshape(12,6)
final = pandas.DataFrame(table)
final
在我的迭代版本中,我无法找到将每个结果集附加到最终DataFrame的方法:
hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
r=requests.get(item)
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
temp = []
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text)
i=i+1
table = np.array(temp).reshape((int(len(temp)/6)),6)
temp2.append(table)
#df_final = pandas.DataFrame(df)
最后,'表格'拥有我想要的所有数据,但它不容易操作,所以我想将它放在DataFrame中。但是,我得到一个" ValueError:必须通过2-d输入"错误。
我认为这个错误说我有3个阵列可以使它成为3维。这只是一次实践迭代,有超过400家医院,我计划将数据放入数据框,但现在我被困在这里。
答案 0 :(得分:1)
您问题的简单答案是HERE。
困难的部分是接受你的代码并找到不正确的东西。
使用完整代码,我修改了它,如下所示。请复制并与你的差异。
import requests
from bs4 import BeautifulSoup
import pandas
import numpy as np
hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
r=requests.get(item)
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
temp = []
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text)
i=i+1
table = np.array(temp).reshape((int(len(temp)/6)),6)
for array in table:
newArray = []
for x in array:
try:
x = x.encode("ascii")
except:
x = 'cannot convert'
newArray.append(x)
temp2.append(newArray)
df_final = pandas.DataFrame(temp2, columns=['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
print df_final
我尝试使用列表推导来进行ascii转换,这对于字符串显示在数据框中是绝对必要的,但是理解是抛出一个错误,所以我构建了一个异常,并且异常从未显示过。
答案 1 :(得分:1)
我重新组织了一些代码,并且无需编码即可创建数据帧。
解决方案:
hosplist = ["http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Norfolk%20and%20Norwich%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Barnet%20Hospital",
"http://www.njrsurgeonhospitalprofile.org.uk/HospitalProfile?hospitalName=Altnagelvin%20Area%20Hospital"]
temp = []
temp2 = []
df_final = pandas.DataFrame()
for item in hosplist:
r=requests.get(item)
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all(["div"],{"class":"toggle_container"})[1]
i=0
for item in all.find_all("td"):
if i%4 ==0:
temp.append(soup.find_all("span")[4].text)
temp.append(soup.find_all("h5")[0].text)
temp.append(all.find_all("td")[i].text.replace("-","NaN").replace("+",""))
i=i+1
temp2.append(temp)
table = np.array(temp2).reshape((int(len(temp2[0])/6)),6)
df_final = pandas.DataFrame(table, columns=['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
df_final