如何使用Beautiful Soup在Python中提取信息

时间:2011-11-30 16:00:38

标签: python beautifulsoup

<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>

我需要 上传10-29 18:50,大小4.36 GiB和NLUPPER002在两个单独的数组中。我该怎么做?

编辑:

这是html页面的一部分,其中包含许多具有不同值的html字体标记。我需要一个通用的解决方案,如果有的话使用汤。另外,正如所建议的那样,我会研究正则表达式。

Edit2:

我对此有疑问。如果我们使用“class”作为遍历汤的关键,那么它是否会使用python关键字类进行分类并抛出错误?

1 个答案:

答案 0 :(得分:2)

soup = BeautifulSoup(your_data)
uploaded = []
link_data = []
for f in soup.findAll("font", {"class":"detDesc"}):
    uploaded.append(f.contents[0]) 
    link_data.append(f.a.contents[0])  

例如,使用以下数据:

your_data = """
<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>
<div id="meow">test</div>
<font class="detDesc">Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER003</a></font>
"""

运行上面的代码会给你:

>>> print uploaded
[u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ']
>>> print link_data
[u'NLUPPER002', u'NLUPPER003']

要按照您提到的确切形式获取文本,您可以对列表进行后处理或在循环内解析数据。例如:

>>> [",".join(x.split(",")[:2]).replace("&nbsp;", " ") for x in uploaded]
[u'Uploaded 10-29 18:50, Size 4.36 GiB', u'Uploaded 10-26 19:23, Size 1.16 GiB']

P.S。如果你是列表理解的粉丝,解决方案可以表达为一行:

output = [(f.contents[0], f.a.contents[0]) for f in soup.findAll("font", {"class":"detDesc"})]

这会给你:

>>> output  # list of tuples
[(u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'NLUPPER002'), (u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ', u'NLUPPER003')]

>>> uploaded, link_data = zip(*output)  # split into two separate lists
>>> uploaded
(u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ')
>>> link_data
(u'NLUPPER002', u'NLUPPER003')