Question

我尝试解析表并将其数据写入csv，但 beautifoulsoup 无法正确解析表。这是页面： http://projects.fivethirtyeight.com/2016-election-forecast/arizona/

这是我使用的代码：

date=[]
pollster=[]
grade=[]
sample=[]
weight=[]
clinton=[]
trump=[]
johnson=[]
leader=[]
adjusted=[]

import requests
from bs4 import BeautifulSoup
url='http://projects.fivethirtyeight.com/2016-election-forecast/florida/'
r = requests.get(url)
soup=BeautifulSoup(r.content,"lxml")
the_table=soup.find("table", attrs={"class":"t-desktop t-polls"})
rows = the_table.tbody.find_all('tr')
for row in rows:
    if 'data-created' in row.attrs:
        cols = row.find_all('td')
        text_cols = [ele.text.strip() for ele in cols]
        date.append(text_cols[2])
        pollster.append(text_cols[3])
        grade.append(text_cols[4])
        sample.append(text_cols[5])
        weight.append(text_cols[6])
        clinton.append(text_cols[7])
        trump.append(text_cols[8])
        johnson.append(text_cols[9])
        leader.append(text_cols[10])
        adjusted.append(text_cols[11])

import pandas as pd
df=pd.DataFrame(date,columns=['date'])
df['pollster']=pollster
df['grade']=grade
df['sample']=sample
df['weight']=weight
df['clinton']=clinton
df['trump']=trump
df['johnson']=johnson
df['leader']=leader
df['adjusted']=adjusted
from urllib.parse import urlparse
s=urlparse(url)
import os
f=os.getcwd()+"/"+s.path.split('/')[-2] + '.csv'
df.to_csv(f)

它使用错误的数据保存csv：

,date      ,pollster             ,grade,sample ,weight,clinton,trump,johnson,leader     ,adjusted
0,Aug. 21-27,USC Dornsife/LA Times,     ,"2,545",LV    ,44%    ,44%  ,       ,Clinton +1 ,Clinton +4
1,Aug. 24-26,Morning Consult      ,     ,"2,007",RV    ,39%    ,37%  ,8%     ,Clinton +2 ,Clinton +2
2,Aug. 20-26,USC Dornsife/LA Times,     ,"2,460",LV    ,45%    ,43%  ,       ,Clinton +1 ,Clinton +5
3,Aug. 19-25,Ipsos                ,A-   ,334    ,LV    ,50%    ,43%  ,       ,Clinton +7 ,Clinton +7
4,Aug. 19-25,Ipsos                ,A-   ,500    ,LV    ,53%    ,31%  ,       ,Clinton +22,Clinton +22
5,Aug. 19-25,Ipsos                ,A-   ,443    ,LV    ,32%    ,45%  ,       ,Trump +13  ,Trump +13
6,Aug. 19-25,Ipsos                ,A-   ,518    ,LV    ,61%    ,25%  ,       ,Clinton +36,Clinton +36
7,Aug. 19-25,Ipsos                ,A-   ,392    ,LV    ,47%    ,41%  ,       ,Clinton +7 ,Clinton +7
8,Aug. 19-25,Ipsos                ,A-   ,666    ,LV    ,49%    ,42%  ,       ,Clinton +7 ,Clinton +7
and so on.....

如果我更改了beautifoulsoup解析器，仍然是错误的解析。如果我手动保存使用chrome检查器或firefox firebug复制的表格，可以使用。这里生成了正确的数据csv：

  ,date                         ,pollster,grade  ,sample,weight,clinton,trump,johnson,leader    ,adjusted
0 ,Ipsos                        ,A-      ,362    ,LV    ,0.67  ,43%    ,46%  ,       ,Trump +3  ,Trump +3
1 ,CNN/Opinion Research Corp.   ,A-      ,809    ,LV    ,1.40  ,38%    ,45%  ,12%    ,Trump +7  ,Trump +7
2 ,Ipsos                        ,A-      ,438    ,LV    ,0.25  ,39%    ,47%  ,       ,Trump +8  ,Trump +8
3 ,YouGov                       ,B       ,"1,095",LV    ,0.65  ,42%    ,44%  ,5%     ,Trump +2  ,Trump +1
4 ,OH Predictive Insights / MBQF,C+      ,996    ,LV    ,0.44  ,45%    ,42%  ,4%     ,Clinton +3,Clinton +2
5 ,Integrated Web Strategy      ,        ,679    ,LV    ,0.35  ,41%    ,49%  ,3%     ,Trump +8  ,Trump +5
6 ,Public Policy Polling        ,B+      ,691    ,V     ,0.49  ,40%    ,44%  ,       ,Trump +4  ,Trump +1
7 ,OH Predictive Insights / MBQF,C+      ,"1,060",LV    ,0.16  ,47%    ,42%  ,       ,Clinton +4,Clinton +4
8 ,Greenberg Quinlan Rosner     ,B-      ,300    ,LV    ,0.23  ,39%    ,45%  ,10%    ,Trump +6  ,Trump +6
9 ,Public Policy Polling        ,B+      ,896    ,V     ,0.20  ,38%    ,40%  ,6%     ,Trump +2  ,Tie
10,Behavior Research Center     ,A       ,564    ,RV    ,0.16  ,42%    ,35%  ,       ,Clinton +7,Clinton +5
11,Merrill Poll                 ,B       ,701    ,LV    ,0.11  ,38%    ,38%  ,       ,Tie       ,Tie
12,Strategies 360               ,B       ,504    ,LV    ,0.03  ,42%    ,44%  ,       ,Trump +2  ,Tie

为什么来自网络的整个HTML会让beatifulsoup错误解析？

[编辑：已解决] 此代码使用正则表达式从脚本标记中提取 json对象 race.stateData。数据最终将被解析。

r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
script = soup.body.script.text
script = script.replace("\n", "")
re_match = re.match('.*race\.stateData = (.*);race\.path', script)
str_json = re_match.group(1)
j = json.loads(str_json)
#parsing data code not relevant..

Answer 1

正如您在评论中看到的那样，我解决了使用正则表达式从脚本标记中提取 json对象 race.stateData的问题。数据最终将被解析。

python beautifoulsoup错误的解析表

1 个答案: