从python列表中删除HTML标记的最佳方法是什么?

时间:2014-06-12 17:54:32

标签: python beautifulsoup

使用BeautifulSoup,我有:

 from bs4 import BeautifulSoup
url= "http://scores.espn.go.com/ncb/playbyplay?gameId=400551234"
import urllib2
page= urllib2.urlopen(url).read()
soup= BeautifulSoup(page)
tr_tags= soup.findAll("tr", attrs={"class": True})
for tag in tr_tags:
    if "even"  in tag["class"]:td_tagsa=soup.findAll("td")
    if "odd"  in tag["class"]:td_tagsb=soup.findAll("td")
td_tagsa.extend(td_tagsb)
td_tags=td_tagsa
a=''.join(td_tags.stripped_strings)

此时我尝试使用stripped_strings命令并获得错误

'list' object has no attribute 'stripped_strings'

然而,当我尝试将元素连接到str而不剥离HTML时:

a=''.join(td_tags)
TypeError: sequence item 0: expected string, Tag found

好像一旦BeautifulSoup输出一个列表,html就会被锁定。有没有办法在开始使用findAll命令后摆脱HTML标签?

1 个答案:

答案 0 :(得分:1)

首先,下载并安装Requests库:Link。它通常比urllib2更容易使用,而且它有各种各样的好东西可用于刮擦。

其次,你应该掌握BeautifulSoup的基本要点和一般的列表理解。我假设你没有打扰文档,因为如果你这样做,你就会知道get_text()会在元素中得到文本。

那就是说,我的代码在下面。我使用了Requestscsv库。这几乎是您所做工作的高级版本,了解它如何立即将结果输出到文件。请务必阅读并理解评论。我基本上已经为你完成了这项工作,所以你至少可以完成代码并理解它的每一行

from bs4 import BeautifulSoup as bsoup
import requests as rq
import csv

## Create a soup from the URL's markup.
url = "http://scores.espn.go.com/ncb/playbyplay?gameId=400551234"
r = rq.get(url)
soup = bsoup(r.content)

## Find all the rows that have classes. Remove the first one -- it's irrelevant.
trs = soup.find_all("tr", class_=True)[1:]

## Main procedure.
with open("scores.csv", "wb") as ofile:

    f = csv.writer(ofile)

    ## Write the headers. 
    f.writerow(["Time","Kentucky","Score","Connecticut"])

    ## For every tr tag in trs, there are anywhere from 2-4 td tags, depending
    ## on what is shown in the markup. For some rows, the third and fourth td
    ## elements don't exist (td[2] and td[3]). This is why we're going to use
    ## a simple try-except-finally block to properly catch this possibility.
    for tr in trs:
        tds = tr.find_all("td")
        time = tds[0].get_text().encode("utf-8")
        kentucky = tds[1].get_text().encode("utf-8")
        ## The following two columns don't always exist (ie. "End of Game" type of lines)
        ## We'll attempt to get them. However, if an error occurs...
        try:
            score = tds[2].get_text().encode("utf-8")
            connecticut = tds[3].get_text().encode("utf-8")
        ## ... We assign them an empty string.
        except:
            score = ""
            connecticut = ""
        ## Finally, regardless of whether there are 2 or more td elements found, we
        ## write the result to the CSV file.
        finally:
            f.writerow([time,kentucky,score,connecticut])

以上将在存储脚本的文件夹中创建一个名为scores.csv的文件。检查,清理它,你有什么。在庆祝之前,请确保您了解代码。 ;)

如果有帮助,请告诉我们。