当班级名称出现在多个地方时,在beautifulsoup中搜索项目

时间:2015-01-30 22:59:55

标签: python html beautifulsoup html-parsing

我的问题在于BeautifulSoup和Python。我试图抓一个网站,但问题是div和类名出现在整个html的多个位置,所以当我刮它时只显示每个类的第一个匹配。这是一个例子

from bs4 import BeautifulSoup
import csv
import urllib2

url= 'http://www.thinkgeek.com/interests/marvel/?icpg=HP_BrandLogos_Top_Color_Marvel'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

for a in soup.findAll("div",{"class": "footer-link-column"}):
    print a.a.contents[0]

如果我运行它,它只返回每个部分的第一个已清理的html。帮助将不胜感激。 (这个网站只是一个例子,真正的网站有同样的问题)

2 个答案:

答案 0 :(得分:0)

我想你可能想使用findall:

for a in soup.findAll("div",{"class": "footer-link-column"}):
    print("\n".join([a.text.strip() for a in a.find_all("a")]))
print(" ")



Returns & Exchanges
Order Status
Shipping
Accounts
Ordering
Size Charts
Gift Options
Gift Certificates
International Orders
Privacy & Security
Terms of Use
Live Chat

About Us
Jobs
Our Blog
Press
Contact Us
Newsletter
Volume Purchases
Affiliates
Sitemap

Account
Order Management
GeekPoints
Forgot Password
Wish Lists
Return Requests
Address Book
Submit Action Shot
Submit a T-Shirt Design

答案 1 :(得分:0)

使用select()并分别循环部分和链接:

import urllib2
from bs4 import BeautifulSoup

url = 'http://www.thinkgeek.com/interests/marvel/?icpg=HP_BrandLogos_Top_Color_Marvel'
soup = BeautifulSoup(urllib2.urlopen(url))

for section in soup.select("div.footer-link-column"):
    section_name = section.h4.get_text(strip=True)
    print section_name
    print
    for item in section('a'):
        print item.get_text(strip=True)
    print "----"

打印:

Customer Service

Returns & Exchanges
Order Status
Shipping
Accounts
Ordering
Size Charts
Gift Options
Gift Certificates
International Orders
Privacy & Security
Terms of Use
Live Chat
----
About ThinkGeek

About Us
Jobs
Our Blog
Press
Contact Us
Newsletter
Volume Purchases
Affiliates
Sitemap
----
Come to the "not-so-dark" side

Account
Order Management
GeekPoints
Forgot Password
Wish Lists
Return Requests
Address Book
Submit Action Shot
Submit a T-Shirt Design
----