使用BeautifulSoup,我有:
from bs4 import BeautifulSoup
url= "http://scores.espn.go.com/ncb/playbyplay?gameId=400551234"
import urllib2
page= urllib2.urlopen(url).read()
soup= BeautifulSoup(page)
tr_tags= soup.findAll("tr", attrs={"class": True})
for tag in tr_tags:
if "even" in tag["class"]:td_tagsa=soup.findAll("td")
if "odd" in tag["class"]:td_tagsb=soup.findAll("td")
td_tagsa.extend(td_tagsb)
td_tags=td_tagsa
a=''.join(td_tags.stripped_strings)
此时我尝试使用stripped_strings命令并获得错误
'list' object has no attribute 'stripped_strings'
然而,当我尝试将元素连接到str而不剥离HTML时:
a=''.join(td_tags)
TypeError: sequence item 0: expected string, Tag found
好像一旦BeautifulSoup输出一个列表,html就会被锁定。有没有办法在开始使用findAll命令后摆脱HTML标签?
答案 0 :(得分:1)
首先,下载并安装Requests
库:Link。它通常比urllib2
更容易使用,而且它有各种各样的好东西可用于刮擦。
其次,你应该掌握BeautifulSoup
的基本要点和一般的列表理解。我假设你没有打扰文档,因为如果你这样做,你就会知道get_text()
会在元素中得到文本。
那就是说,我的代码在下面。我使用了Requests
和csv
库。这几乎是您所做工作的高级版本,了解它如何立即将结果输出到文件。请务必阅读并理解评论。我基本上已经为你完成了这项工作,所以你至少可以完成代码并理解它的每一行。
from bs4 import BeautifulSoup as bsoup
import requests as rq
import csv
## Create a soup from the URL's markup.
url = "http://scores.espn.go.com/ncb/playbyplay?gameId=400551234"
r = rq.get(url)
soup = bsoup(r.content)
## Find all the rows that have classes. Remove the first one -- it's irrelevant.
trs = soup.find_all("tr", class_=True)[1:]
## Main procedure.
with open("scores.csv", "wb") as ofile:
f = csv.writer(ofile)
## Write the headers.
f.writerow(["Time","Kentucky","Score","Connecticut"])
## For every tr tag in trs, there are anywhere from 2-4 td tags, depending
## on what is shown in the markup. For some rows, the third and fourth td
## elements don't exist (td[2] and td[3]). This is why we're going to use
## a simple try-except-finally block to properly catch this possibility.
for tr in trs:
tds = tr.find_all("td")
time = tds[0].get_text().encode("utf-8")
kentucky = tds[1].get_text().encode("utf-8")
## The following two columns don't always exist (ie. "End of Game" type of lines)
## We'll attempt to get them. However, if an error occurs...
try:
score = tds[2].get_text().encode("utf-8")
connecticut = tds[3].get_text().encode("utf-8")
## ... We assign them an empty string.
except:
score = ""
connecticut = ""
## Finally, regardless of whether there are 2 or more td elements found, we
## write the result to the CSV file.
finally:
f.writerow([time,kentucky,score,connecticut])
以上将在存储脚本的文件夹中创建一个名为scores.csv
的文件。检查,清理它,你有什么。在庆祝之前,请确保您了解代码。 ;)
如果有帮助,请告诉我们。