我使用BeautifulSoup来解析这个网站:
http://www.livescore.com/soccer/champions-league/
我希望获得带数字的行的链接:
FT Zenit St. Petersburg 3 - 0 Standard Liege"
3 - 0是一个链接链接;我想要做的是找到每个带数字的链接(所以不是像
那样的结果 15:45 APOEL Nicosia ? - ? Paris Saint Germain
),所以我可以加载这些链接并解析分钟数据(<td class="min">
)
您好!!!需要编辑。现在我能够获得链接。像这样:
import urllib2, re, bs4
sitioweb = urllib2.urlopen('http://www.livescore.com/soccer/champions-league/').read()
soup = bs4.BeautifulSoup(sitioweb)
href_tags = soup.find_all('a', {'class':"scorelink"})
links = []
for x in xrange(1, len(href_tags)):
insert = href_tags[x].get("href");links.append(insert)
print links
现在我的问题如下:我想把所有这些写进一个DB(比如sqlite),其中包含一个目标的分钟数(这个信息我可以从我得到的链接中得到)但这是可能的只有在目标数不是的情况下? - ?,因为没有任何目标。
我希望你能理解我......
致以最诚挚的问候,非常感谢您的帮助,
马
答案 0 :(得分:1)
以下搜索仅匹配您的链接:
import re
links = soup.find_all('a', class_='scorelink', href=True,
text=re.compile('\d+ - \d+'))
搜索仅限于:
<a>
标签scorelink
href
属性只提取链接是微不足道的:
score_urls = [link['href'] for link in soup.find_all(
'a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]
演示:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> from pprint import pprint
>>> soup = BeautifulSoup(requests.get('http://www.livescore.com/soccer/champions-league/').content)
>>> [link['href'] for link in soup.find_all('a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/', '/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/', '/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/', '/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/', '/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/', '/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/', '/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/', '/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/', '/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/', '/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/', '/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']
>>> pprint(_)
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
'/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
'/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/',
'/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/',
'/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/',
'/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/',
'/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/',
'/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/',
'/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/',
'/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/',
'/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']
答案 1 :(得分:0)
在BeautifulSoup
之外很容易做到。首先找到所有链接,然后过滤掉返回? - ?
文本的链接,然后从清理列表中的每个项目中获取href
属性。见下文。
In [1]: from bs4 import BeautifulSoup as bsoup
In [2]: import requests as rq
In [3]: url = "http://www.livescore.com/soccer/champions-league/"
In [4]: r = rq.get(url)
In [5]: bs = bsoup(r.text)
In [6]: links = bs.find_all("a", class_="scorelink")
In [7]: links
Out[7]:
[<a class="scorelink" href="/soccer/champions-league/group-a/atletico-madrid-vs-malmo-ff/1-1821150/" onclick="return false;">? - ?</a>,
<a class="scorelink" href="/soccer/champions-league/group-a/olympiakos-vs-juventus/1-1821151/" onclick="return false;">? - ?</a>,
...
In [8]: links_clean = [link for link in links if link.get_text() != "? - ?"]
In [9]: links_clean
Out[9]:
[<a class="scorelink" href="/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/" onclick="return false;">0 - 1</a>,
<a class="scorelink" href="/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/" onclick="return false;">3 - 0</a>,
...
In [10]: links_final = [link["href"] for link in links_clean]
In [11]: links_final
Out[11]:
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
'/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
...
从每个链接中提取分钟当然取决于您。