我尝试解析表并将其数据写入csv,但 beautifoulsoup 无法正确解析表。 这是页面: http://projects.fivethirtyeight.com/2016-election-forecast/arizona/
这是我使用的代码:
date=[]
pollster=[]
grade=[]
sample=[]
weight=[]
clinton=[]
trump=[]
johnson=[]
leader=[]
adjusted=[]
import requests
from bs4 import BeautifulSoup
url='http://projects.fivethirtyeight.com/2016-election-forecast/florida/'
r = requests.get(url)
soup=BeautifulSoup(r.content,"lxml")
the_table=soup.find("table", attrs={"class":"t-desktop t-polls"})
rows = the_table.tbody.find_all('tr')
for row in rows:
if 'data-created' in row.attrs:
cols = row.find_all('td')
text_cols = [ele.text.strip() for ele in cols]
date.append(text_cols[2])
pollster.append(text_cols[3])
grade.append(text_cols[4])
sample.append(text_cols[5])
weight.append(text_cols[6])
clinton.append(text_cols[7])
trump.append(text_cols[8])
johnson.append(text_cols[9])
leader.append(text_cols[10])
adjusted.append(text_cols[11])
import pandas as pd
df=pd.DataFrame(date,columns=['date'])
df['pollster']=pollster
df['grade']=grade
df['sample']=sample
df['weight']=weight
df['clinton']=clinton
df['trump']=trump
df['johnson']=johnson
df['leader']=leader
df['adjusted']=adjusted
from urllib.parse import urlparse
s=urlparse(url)
import os
f=os.getcwd()+"/"+s.path.split('/')[-2] + '.csv'
df.to_csv(f)
它使用错误的数据保存csv:
,date ,pollster ,grade,sample ,weight,clinton,trump,johnson,leader ,adjusted
0,Aug. 21-27,USC Dornsife/LA Times, ,"2,545",LV ,44% ,44% , ,Clinton +1 ,Clinton +4
1,Aug. 24-26,Morning Consult , ,"2,007",RV ,39% ,37% ,8% ,Clinton +2 ,Clinton +2
2,Aug. 20-26,USC Dornsife/LA Times, ,"2,460",LV ,45% ,43% , ,Clinton +1 ,Clinton +5
3,Aug. 19-25,Ipsos ,A- ,334 ,LV ,50% ,43% , ,Clinton +7 ,Clinton +7
4,Aug. 19-25,Ipsos ,A- ,500 ,LV ,53% ,31% , ,Clinton +22,Clinton +22
5,Aug. 19-25,Ipsos ,A- ,443 ,LV ,32% ,45% , ,Trump +13 ,Trump +13
6,Aug. 19-25,Ipsos ,A- ,518 ,LV ,61% ,25% , ,Clinton +36,Clinton +36
7,Aug. 19-25,Ipsos ,A- ,392 ,LV ,47% ,41% , ,Clinton +7 ,Clinton +7
8,Aug. 19-25,Ipsos ,A- ,666 ,LV ,49% ,42% , ,Clinton +7 ,Clinton +7
and so on.....
如果我更改了beautifoulsoup解析器,仍然是错误的解析。 如果我手动保存使用chrome检查器或firefox firebug复制的表格,可以使用。这里生成了正确的数据csv:
,date ,pollster,grade ,sample,weight,clinton,trump,johnson,leader ,adjusted
0 ,Ipsos ,A- ,362 ,LV ,0.67 ,43% ,46% , ,Trump +3 ,Trump +3
1 ,CNN/Opinion Research Corp. ,A- ,809 ,LV ,1.40 ,38% ,45% ,12% ,Trump +7 ,Trump +7
2 ,Ipsos ,A- ,438 ,LV ,0.25 ,39% ,47% , ,Trump +8 ,Trump +8
3 ,YouGov ,B ,"1,095",LV ,0.65 ,42% ,44% ,5% ,Trump +2 ,Trump +1
4 ,OH Predictive Insights / MBQF,C+ ,996 ,LV ,0.44 ,45% ,42% ,4% ,Clinton +3,Clinton +2
5 ,Integrated Web Strategy , ,679 ,LV ,0.35 ,41% ,49% ,3% ,Trump +8 ,Trump +5
6 ,Public Policy Polling ,B+ ,691 ,V ,0.49 ,40% ,44% , ,Trump +4 ,Trump +1
7 ,OH Predictive Insights / MBQF,C+ ,"1,060",LV ,0.16 ,47% ,42% , ,Clinton +4,Clinton +4
8 ,Greenberg Quinlan Rosner ,B- ,300 ,LV ,0.23 ,39% ,45% ,10% ,Trump +6 ,Trump +6
9 ,Public Policy Polling ,B+ ,896 ,V ,0.20 ,38% ,40% ,6% ,Trump +2 ,Tie
10,Behavior Research Center ,A ,564 ,RV ,0.16 ,42% ,35% , ,Clinton +7,Clinton +5
11,Merrill Poll ,B ,701 ,LV ,0.11 ,38% ,38% , ,Tie ,Tie
12,Strategies 360 ,B ,504 ,LV ,0.03 ,42% ,44% , ,Trump +2 ,Tie
为什么来自网络的整个HTML会让beatifulsoup错误解析?
[编辑:已解决]
此代码使用正则表达式从 脚本 标记中提取 json对象 race.stateData
。数据最终将被解析。
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
script = soup.body.script.text
script = script.replace("\n", "")
re_match = re.match('.*race\.stateData = (.*);race\.path', script)
str_json = re_match.group(1)
j = json.loads(str_json)
#parsing data code not relevant..
答案 0 :(得分:0)
正如您在评论中看到的那样,我解决了使用正则表达式从脚本标记中提取 json对象 race.stateData
的问题。数据最终将被解析。