我想抓住一个父标签,如果它包含一个标记,让我们说MARKER。例如,我有:
<a>
<b>
<c>
MARKER
</c>
</b>
<b>
<c>
MARKER
MARKER
</c>
</b>
<b>
<c>
stuff
</c>
</b>
</a>
我想抓住:
<b>
<c>
MARKER
</c>
</b>
<b>
<c>
MARKER
MARKER
</c>
</b>
我目前的代码是:
for stuff in soup.find_all(text=re.compile("MARKER")):
post = stuff.find_parent("b")
然而,这有点起作用,它给了我:
<b>
<c>
MARKER
</c>
</b>
<b>
<c>
MARKER
MARKER
</c>
</b>
<b>
<c>
MARKER
MARKER
</c>
</b>
发生这种情况的原因很明显,它为每个找到的MARKER打印整个包含标签一次,因此包含两个MARKER的标签会被打印两次。但是,我不知道如何告诉BeautifulSoup在完成之后不在给定标签内搜索(我怀疑,特别是,无法完成?)或以其他方式阻止这一点,除了可能将所有内容索引到字典并拒绝重复?
编辑: 这是我正在处理的具体情况,这给我带来了麻烦,因为出于某种原因,尽管是剥离版本,但上述实际上并没有产生错误。 (如果有人好奇的话,那个特定的论坛帖子我正在逐个发帖。)
from bs4 import BeautifulSoup
import urllib.request
import re
url = 'http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179'
soup = urllib.request.urlopen(url).read()
sbsoup = BeautifulSoup(soup)
for stuff in sbsoup.find_all(text=re.compile("\[[Xx]\]")):
post = stuff.find_parent("li")
print(post.find("a", class_="username").string)
print(post.find("blockquote", class_="messageText ugc baseHtml").get_text())
答案 0 :(得分:0)
我用bs3编写了这个,它可能适用于bs4,但概念是相同的。基本上,li标签在“data-author”属性下具有用户名,因此您不需要找到较低的标签,然后寻找父li。
您似乎只对包含“标记”的blockquote标记感兴趣,为什么不指定?
Lambda函数通常是查询美丽汤的最通用方式。
import os
import sys
# Import System libraries
import re
import urllib2
# Import Custom libraries
#from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
# The url variable to be searched
url = 'http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179'
# Create a request object
request = urllib2.Request(url)
# Attempt to open the request and read the response
try:
response = urllib2.urlopen(request)
the_page = response.read()
except Exception:
the_page = ""
# If the response exists, create a BeautifulSoup from it
if(the_page):
soup = BeautifulSoup(the_page)
# Define the search location for the desired tags
li_location = lambda x: x.name == u"li" and set([("class", "message ")]) <= set(x.attrs)
x_location = lambda x: x.name == u"blockquote" and bool(re.search("\[[Xx]\]", x.text))
# Iterate through all the found lis
for li in soup.findAll(li_location):
# Print the author name
print dict(li.attrs)["data-author"]
# Iterate through all the found blockquotes containing the marker
for xs in li.findAll(x_location):
# Print the text of the found blockquote
print xs.text
print ""