我很难在自己的用例中使用漂亮汤的优点。我想从中获取内容的地方有很多类似但不总是相等的嵌套p标签。示例如下:
<p><span class="example" data-location="1:20">20</span>normal string</p>
<p><span class="example" data-location="1:21">21</span>this text <strong>belongs together</strong></p>
<p><span class="example" data-location="1:22">22</span>some text (<span class="referencequote">a reference text</span>)that might continue</p>
<p><span class="example" data-location="1:23">23</span>more text</p><div class="linebreak"></div>
<p><span class="example" data-location="1:22">24</span>text with (<span class="referencequote">first</span>)two references <span class="referencequote">first</span>.</p>
我需要保存span标记的字符串以及p标记内部的字符串,无论其样式如何,如果适用,也可以引用引用。因此,从上面的示例中,我想提取一下:
example = 20, text = 'normal string', reference = []
example = 21, text = 'this text belongs together', reference = []
example = 22, text = 'some text that might continue', reference = ['a reference text']
example = 23, text = 'more text', reference = []
example = 24, text = 'text with two references', reference = ['first', 'second']
我正在尝试的是收集带有“ example”类的所有项目,然后遍历其父级内容。
for span in bs.find_all("span", {"class": "example"}):
references = []
for item in span.parent.contents:
if (type(item) == NavigableString):
text= item
elif (item['class'][0]) == 'verse':
number= int(item.string)
elif (item['class']) == 'referencequote':
references.append(item.string)
else:
#how to handle <strong> tags?
verses.append(MyClassObject(n=number, t=text, r=references))
我的方法非常容易出错,并且可能有更多<strong>
,<em>
之类的标签现在我正在忽略。不幸的是,get_text()方法返回了诸如“ 22一些文本可能继续的参考文本”之类的东西。
必须有一种优雅的方法来提取此信息。您能给我一些其他方法的想法吗?预先感谢!
答案 0 :(得分:1)
尝试一下。
from simplified_scrapy.core.regex_helper import replaceReg
from simplified_scrapy import SimplifiedDoc,utils
html = '''
<p><span class="example" data-location="1:20">20</span>normal string</p>
<p><span class="example" data-location="1:21">21</span>this text <strong>belongs together</strong></p>
<p><span class="example" data-location="1:22">22</span>some text (<span class="referencequote">a reference text</span>)that might continue</p>
<p><span class="example" data-location="1:23">23</span>more text</p><div class="linebreak"></div>
<p><span class="example" data-location="1:22">24</span>text with (<span class="referencequote">first</span>)two references <span class="referencequote">second</span>.</p>
'''
html = replaceReg(html,"<[/]*strong>","") # Pretreatment
doc = SimplifiedDoc(html)
ps = doc.ps
for p in ps:
text = ''.join(p.spans.nextText())
text = replaceReg(text,"[()]+","") # Remove ()
span = p.span # Get first span
spans = span.getNexts(tag="span").text # Get references
print (span["class"], span.text, text, spans)
结果:
example 20 normal string []
example 21 this text belongs together []
example 22 some text that might continue ['a reference text']
example 23 more text []
example 24 text with two references. ['first', 'second']
还有更多示例。 https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples
答案 1 :(得分:0)
我发现了一种不同的方法-没有正则表达式,并且可能对可能出现的不同范围更健壮
for s in bsItem.select('span'):
if s['class'][0] == 'example' :
# do whatever needed with the content of this span
s.extract()
elif s['class'][0] == 'referencequote':
# do whatever needed with the content of this span
s.extract()
# check for all spans with a class where you want the text excluded
# finally get all the text
text = span.parent.text.replace(' ()', '')
也许这种方法对于阅读此书的人很有趣:)