我的问题在于BeautifulSoup
和Python。我试图抓一个网站,但问题是div
和类名出现在整个html的多个位置,所以当我刮它时只显示每个类的第一个匹配。这是一个例子
from bs4 import BeautifulSoup
import csv
import urllib2
url= 'http://www.thinkgeek.com/interests/marvel/?icpg=HP_BrandLogos_Top_Color_Marvel'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
for a in soup.findAll("div",{"class": "footer-link-column"}):
print a.a.contents[0]
如果我运行它,它只返回每个部分的第一个已清理的html。帮助将不胜感激。 (这个网站只是一个例子,真正的网站有同样的问题)
答案 0 :(得分:0)
我想你可能想使用findall:
for a in soup.findAll("div",{"class": "footer-link-column"}):
print("\n".join([a.text.strip() for a in a.find_all("a")]))
print(" ")
Returns & Exchanges
Order Status
Shipping
Accounts
Ordering
Size Charts
Gift Options
Gift Certificates
International Orders
Privacy & Security
Terms of Use
Live Chat
About Us
Jobs
Our Blog
Press
Contact Us
Newsletter
Volume Purchases
Affiliates
Sitemap
Account
Order Management
GeekPoints
Forgot Password
Wish Lists
Return Requests
Address Book
Submit Action Shot
Submit a T-Shirt Design
答案 1 :(得分:0)
使用select()
并分别循环部分和链接:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.thinkgeek.com/interests/marvel/?icpg=HP_BrandLogos_Top_Color_Marvel'
soup = BeautifulSoup(urllib2.urlopen(url))
for section in soup.select("div.footer-link-column"):
section_name = section.h4.get_text(strip=True)
print section_name
print
for item in section('a'):
print item.get_text(strip=True)
print "----"
打印:
Customer Service
Returns & Exchanges
Order Status
Shipping
Accounts
Ordering
Size Charts
Gift Options
Gift Certificates
International Orders
Privacy & Security
Terms of Use
Live Chat
----
About ThinkGeek
About Us
Jobs
Our Blog
Press
Contact Us
Newsletter
Volume Purchases
Affiliates
Sitemap
----
Come to the "not-so-dark" side
Account
Order Management
GeekPoints
Forgot Password
Wish Lists
Return Requests
Address Book
Submit Action Shot
Submit a T-Shirt Design
----