import requests
from bs4 import BeautifulSoup
data = requests.get("http://www.basketball-reference.com/leagues/NBA_2014_games.html")
soup = BeautifulSoup(data.content)
soup.find_all("a")
for link in soup.find_all("a"):
"<a href='%s'>%s</a>" %(link.get("href=/boxscores"),link.text)
我正在尝试获取盒子分数的链接。然后运行循环并将各个链接中的数据组织到csv中。我需要将链接保存为向量并运行循环....然后我被卡住了,我不确定这是否是正确的方法。
答案 0 :(得分:2)
我们的想法是迭代所有具有href
属性(a[href]
CSS Selector)的链接,然后循环链接和construct an absolute link if href
属性价值不以http
开头。将所有链接收集到列表列表中,并使用writerows()
将其转储到csv:
import csv
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.basketball-reference.com'
data = requests.get("http://www.basketball-reference.com/leagues/NBA_2014_games.html")
soup = BeautifulSoup(data.content)
links = [[urljoin(base_url, link['href']) if not link['href'].startswith('http') else link['href']]
for link in soup.select("a[href]")]
with open('output.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(links)
output.csv
现在包含:
http://www.sports-reference.com
http://www.baseball-reference.com
http://www.sports-reference.com/cbb/
http://www.pro-football-reference.com
http://www.sports-reference.com/cfb/
http://www.hockey-reference.com/
http://www.sports-reference.com/olympics/
http://www.sports-reference.com/blog/
http://www.sports-reference.com/feedback/
http://www.basketball-reference.com/my/auth.cgi
http://twitter.com/bball_ref
...
目前还不清楚你的输出应该是什么,但这至少是你可以使用的起点。