我是Python的新手,我正在开展一个学习项目,我试图在大学足球运动员身上搜集一些数据。该网站的源代码如下所示:
</thead>
<tbody>
> <tr ><th scope="row" class="right " data-stat="year_id" ><a
> href="/cfb/years/1957.html">1957</a></th><td class="left "
> data-stat="school_name" csk="San Jose State.1957" ><a
> href="/cfb/schools/san-jose-state/1957.html">San Jose
> State</a></td><td class="left " data-stat="conf_abbr" ><a
> href="/cfb/conferences/independent/1957.html">Ind</a></td><td
> class="center " data-stat="class" ></td><td class="center "
> data-stat="pos" >RB</td><td class="right " data-stat="g" >10</td><td
> class="right " data-stat="rec" >1</td><td class="right "
> data-stat="rec_yds" >6</td><td class="right "
> data-stat="rec_yds_per_rec" >6.0</td><td class="right "
> data-stat="rec_td" >0</td><td class="right " data-stat="rush_att"
> >1</td><td class="right " data-stat="rush_yds" >3</td><td class="right " data-stat="rush_yds_per_att" >3.0</td><td class="right "
> data-stat="rush_td" >0</td><td class="right " data-stat="scrim_att"
> >2</td><td class="right " data-stat="scrim_yds" >9</td><td class="right " data-stat="scrim_yds_per_att" >4.5</td><td class="right
> " data-stat="scrim_td" >0</td></tr>
以下是我对代码的了解程度:
headers = [item["data-stat"] for item in soup.find_all(attrs={"data-stat" : True})]
cellStrings = [cell.find(text = True) for cell in soup.findAll('td')]
print headers, cellStrings
这打印出以下内容:
[u'', u'header_receiving', u'header_rushing', u'header_scrimmage', u'year_id', u'school_name', u'conf_abbr', u'class', u'pos', u'g', u'rec', u'rec_yds', u'rec_yds_per_rec', u'rec_td', u'rush_att', u'rush_yds', u'rush_yds_per_att', u'rush_td', u'scrim_att', u'scrim_yds', u'scrim_yds_per_att', u'scrim_td', u'year_id', u'school_name', u'conf_abbr', u'class', u'pos', u'g', u'rec', u'rec_yds', u'rec_yds_per_rec', u'rec_td', u'rush_att', u'rush_yds', u'rush_yds_per_att', u'rush_td', u'scrim_att', u'scrim_yds', u'scrim_yds_per_att', u'scrim_td', u'year_id', u'school_name', u'conf_abbr', u'class', u'pos', u'g', u'rec', u'rec_yds', u'rec_yds_per_rec', u'rec_td', u'rush_att', u'rush_yds', u'rush_yds_per_att', u'rush_td', u'scrim_att', u'scrim_yds', u'scrim_yds_per_att', u'scrim_td'] [u'San Jose State', u'Ind', None, u'RB', u'10', u'1', u'6', u'6.0', u'0', u'1', u'3', u'3.0', u'0', u'2', u'9', u'4.5', u'0', u'San Jose State', None, None, None, None, u'1', u'6', u'6.0', u'0', u'1', u'3', u'3.0', u'0', u'2', u'9', u'4.5', u'0']
问题是源代码中出现了一些标题,因此两个列表,数据和标题不匹配。
我的问题是如何提取数据统计数据&#39;以及它的相关值而不是单独拉它们?理想情况下,我会将其作为字典。
答案 0 :(得分:0)
如果我正确地找到你,你需要一个由{'data-stat-value': 'value of td'}
组成的字典;你可以这样做:
data_stats = {e['data-stat']: e.get_text().strip()
for e in html.find_all(attrs={'data-stat': True})}
通过这种方式,它肯定会提取与data-stat
标记相关联的文字。