使用Beautiful Soup模块,如何获取类名为div
的{{1}}标记的数据?是吗:
feeditemcontent cxfeeditemcontent
或:
soup.class['feeditemcontent cxfeeditemcontent']
这是HTML源代码:
soup.find_all('class')
这是Python代码:
<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>
答案 0 :(得分:22)
Beautiful Soup 4将“class”属性的值视为列表而不是字符串,这意味着jadkik94的解决方案可以简化:
from bs4 import BeautifulSoup
def match_class(target):
def do_match(tag):
classes = tag.get('class', [])
return all(c in classes for c in target)
return do_match
soup = BeautifulSoup(html)
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))
答案 1 :(得分:10)
尝试这个,也许这对于这个简单的事情来说太过分了,但它确实有效:
def match_class(target):
target = target.split()
def do_match(tag):
try:
classes = dict(tag.attrs)["class"]
except KeyError:
classes = ""
classes = classes.split()
return all(c in classes for c in target)
return do_match
html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
print m
print "-"*10
matches = soup.findAll(match_class("feeditembody"))
for m in matches:
print m
print "-"*10
答案 2 :(得分:6)
soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")
所以,如果我想从stackoverflow.com获取类头<div class="header">
的所有div标签,那么BeautifulSoup的一个例子就是:
from bs4 import BeautifulSoup as bs
import requests
url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)
tags = soup.findAll("div", class_="header")
已经在bs4 documentation。
答案 3 :(得分:3)
soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})
答案 4 :(得分:3)
from BeautifulSoup import BeautifulSoup
f = open('a.htm')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'id':'abc def'})
print list
答案 5 :(得分:0)
检查此错误报告:https://bugs.launchpad.net/beautifulsoup/+bug/410304
正如您所看到的,Beautiful soup无法真正理解class="a b"
两个类a
和b
。
但是,正如第一条评论中出现的那样,一个简单的正则表达式就足够了。在你的情况下:
soup = BeautifulSoup(html_doc)
for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}):
print "result: ",x
注意:这在最近的测试版中得到修复。我没有浏览最近版本的文档,也许你可以做到这一点。或者,如果您想使用旧版本运行它,可以使用上面的内容。