我想使用python脚本刮取数据

时间:2016-09-09 12:37:49

标签: python html5 python-3.x beautifulsoup python-3.4

我编写了python脚本来从http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings抓取数据 它是100名玩家的名单,我成功地删除了这些数据。问题是,当我运行脚本而不是一次刮取数据时,它会刮掉相同的数据3次。

<div class="cb-col cb-col-100 cb-font-14 cb-lst-itm text-center">
  <div class="cb-col cb-col-16 cb-rank-tbl cb-font-16">1</div>
  <div class="cb-col cb-col-50 cb-lst-itm-sm text-left">
    <div class="cb-col cb-col-33">
      <div class="cb-col cb-col-50">
        <span class=" cb-ico" style="position:absolute;"></span>&nbsp;&nbsp;&nbsp;&nbsp;–
      </div>
      <div class="cb-col cb-col-50">
        <img src="http://i.cricketcb.com/i/stats/fw/50x50/img/faceImages/2250.jpg" class="img-responsive cb-rank-plyr-img">
      </div>
    </div>
    <div class="cb-col cb-col-67 cb-rank-plyr">
      <a class="text-hvr-underline text-bold cb-font-16" href="/profiles/2250/steven-smith" title="Steven Smith's Profile">Steven Smith</a>
      <div class="cb-font-12 text-gray">AUSTRALIA</div>
    </div>
  </div>
  <div class="cb-col cb-col-17 cb-rank-tbl">906</div>
  <div class="cb-col cb-col-17 cb-rank-tbl">1</div>
</div>

这里是我编写废弃每个玩家数据的python脚本。

import sys,requests,csv,io
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")

maindiv = soup.find_all("div", {"class": "text-center"})
for div in maindiv:
	print(div.text)

但不是一次刮取数据,而是将相同的数据刮掉3次。

我可以在哪里进行一次更改以获取数据?

2 个答案:

答案 0 :(得分:1)

选择表并查找其中的div:

maindiv = soup.select("#batsmen-tests div.text-center")
for div in maindiv:
    print(div.text)

如果您只想要播放器名称,那么您的原始输出和上面的输出会将div中的所有文本作为一行而不是真正有用:

anchors = soup.select("#batsmen-tests div.cb-rank-plyr a")
for a in anchors:
    print(a.text)

以简单的csv格式获取数据的快捷方法是从每个孩子那里获取文本:

maindiv = soup.select("#batsmen-tests div.text-center")
for d in maindiv[1:]:
    row_data = u",".join(s.strip() for s in filter(None, (t.find(text=True, recursive=False) for t in d.find_all())))
    if row_data:
        print(row_data)

现在输出如下:

# rank, up/down, name, country, rating, best rank
1,–,Steven Smith,AUSTRALIA,906,1
2,–,Joe Root,ENGLAND,878,1
3,–,Kane Williamson,NEW ZEALAND,876,1
4,–,Hashim Amla,SOUTH AFRICA,847,1
5,–,Younis Khan,PAKISTAN,845,1
6,–,Adam Voges,AUSTRALIA,802,5
7,–,AB de Villiers,SOUTH AFRICA,802,1
8,–,Ajinkya Rahane,INDIA,785,8
9,2,David Warner,AUSTRALIA,772,3
10,–,Alastair Cook,ENGLAND,770,2
11,1,Misbah-ul-Haq,PAKISTAN,764,6

相反:

PositionPlayerRatingBest Rank
Player
1    –Steven SmithAUSTRALIA9061
2    –Joe RootENGLAND8781
3    –Kane WilliamsonNEW ZEALAND8761
4    –Hashim AmlaSOUTH AFRICA8471
5    –Younis KhanPAKISTAN8451
6    –Adam VogesAUSTRALIA8025

答案 1 :(得分:-1)

你输出三次的原因是因为网站有三个类别你必须选择它然后你可以使用它。

使用代码执行此操作的最简单方法是添加一行

import sys,requests,csv,io
from bs4 import BeautifulSoup

url = "http://www.cricbuzz.com/cricket-stats/icc-rankings/batsmen-   rankings"
r = requests.get(url)
r.content
soup = BeautifulSoup(r.content, "html.parser")

specific_div = soup.find_all("div", {"id": "batsmen-tests"})
maindiv = specific_div[0].find_all("div", {"class": "text-center"})
for div in maindiv:
    print(div.text) 

这只会给测试击球手提供类似的重复,对于其他输出只需更改&#34; id&#34;在specific_div行。