LiveScore BeautifulSoup Python

时间:2014-10-21 16:27:48

标签: python beautifulsoup

我使用BeautifulSoup来解析这个网站:

http://www.livescore.com/soccer/champions-league/

我希望获得带数字的行的链接:

FT  Zenit St. Petersburg    3 - 0   Standard Liege"

3 - 0是一个链接链接;我想要做的是找到每个带数字的链接(所以不是像

那样的结果
 15:45  APOEL Nicosia   ? - ?   Paris Saint Germain

),所以我可以加载这些链接并解析分钟数据(<td class="min">

您好!!!需要编辑。现在我能够获得链接。像这样:

import urllib2, re, bs4

sitioweb = urllib2.urlopen('http://www.livescore.com/soccer/champions-league/').read()
soup = bs4.BeautifulSoup(sitioweb)
href_tags = soup.find_all('a', {'class':"scorelink"})

links = []

for x in xrange(1, len(href_tags)):
    insert = href_tags[x].get("href");links.append(insert)

print links

现在我的问题如下:我想把所有这些写进一个DB(比如sqlite),其中包含一个目标的分钟数(这个信息我可以从我得到的链接中得到)但这是可能的只有在目标数不是的情况下? - ?,因为没有任何目标。

我希望你能理解我......

致以最诚挚的问候,非常感谢您的帮助,

2 个答案:

答案 0 :(得分:1)

以下搜索仅匹配您的链接:

import re

links = soup.find_all('a', class_='scorelink', href=True,
                      text=re.compile('\d+ - \d+'))

搜索仅限于:

  • <a>标签
  • 与班级scorelink
  • 非空href属性
  • 和包含由短划线分隔的两位数字的链接文字。

只提取链接是微不足道的:

score_urls = [link['href'] for link in soup.find_all(
                  'a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]

演示:

>>> from bs4 import BeautifulSoup 
>>> import requests
>>> from pprint import pprint
>>> soup = BeautifulSoup(requests.get('http://www.livescore.com/soccer/champions-league/').content)
>>> [link['href'] for link in soup.find_all('a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/', '/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/', '/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/', '/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/', '/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/', '/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/', '/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/', '/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/', '/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/', '/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/', '/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']
>>> pprint(_)
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
 '/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
 '/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/',
 '/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/',
 '/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/',
 '/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/',
 '/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/',
 '/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/',
 '/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/',
 '/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/',
 '/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']

答案 1 :(得分:0)

BeautifulSoup之外很容易做到。首先找到所有链接,然后过滤掉返回? - ?文本的链接,然后从清理列表中的每个项目中获取href属性。见下文。

In [1]: from bs4 import BeautifulSoup as bsoup

In [2]: import requests as rq

In [3]: url = "http://www.livescore.com/soccer/champions-league/"

In [4]: r = rq.get(url)

In [5]: bs = bsoup(r.text)

In [6]: links = bs.find_all("a", class_="scorelink")

In [7]: links
Out[7]: 
[<a class="scorelink" href="/soccer/champions-league/group-a/atletico-madrid-vs-malmo-ff/1-1821150/" onclick="return false;">? - ?</a>,
 <a class="scorelink" href="/soccer/champions-league/group-a/olympiakos-vs-juventus/1-1821151/" onclick="return false;">? - ?</a>,
...

In [8]: links_clean = [link for link in links if link.get_text() != "? - ?"]

In [9]: links_clean
Out[9]: 
[<a class="scorelink" href="/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/" onclick="return false;">0 - 1</a>,
 <a class="scorelink" href="/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/" onclick="return false;">3 - 0</a>,
...

In [10]: links_final = [link["href"] for link in links_clean]

In [11]: links_final
Out[11]: 
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
 '/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
...

从每个链接中提取分钟当然取决于您。