尝试仅获取两个强标签之间的文本

时间:2019-08-15 20:08:58

标签: python web-scraping beautifulsoup

我目前正在尝试仅获取在Strong标签的前两次出现之间的HTML文本(名称列表)。

这是我删除的HTML的简短示例

<h3>Title of Article</h3>

<p><strong>Section Header 1</strong></p>

<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>

<p>PRESENT:</p>

<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>

....
....
....

<p>William, Baller</p>

<p>Roy Williams, Coach</p>

<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
....
....
....
....
....


她是一些我编写的快速代码,具有计算强标签出现次数的基本逻辑。我知道在第二次发生之后,我想要的所有名称都已被收集

html = requests.get('https://www.somewebsite.com')
soup = BS(html.text, 'html.parser')

#Pull only the HTML from the article that I am interested in 
notes = soup.find('div', attrs = {'id' : 'article'})


# Define a function to print true if a string contains <strong>
def findstrong(i):
    return "</strong>" in i


# initialize a value for strong, after the second strong I know all the 
# names I am interested in have been collected 
strong_counts = 0



list_of_names = []
for i in range(len(notes)):

    if strong_counts < 2:

        note = notes.contents[i]
        #make note string so we can use the findstrong function
        note_2_str = str(note)

        if findstrong(note_2_str) == False:
            list_of_names.append(note)
        else:
            strong_counts += 1    

循环起作用并收集第一个强标签之前的所有文本,以及直到强标签的下一次出现为止的所有文本。即

<h3>Title of Article</h3>

<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>

<p>PRESENT:</p>

<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>

....
....
....

<p>William, Baller</p>

<p>Roy Williams, Coach</p>

这基本上可以实现我想要的功能,但是由于它现在是列表,所以我失去了BeautifulSoup对象的某些功能。是否有BeautifulSoup函数可以帮助我执行此操作或其他选择?还是在将其扩展到多篇文章之前,我应该专注于使此循环更有效?

3 个答案:

答案 0 :(得分:2)

基于包含存在使用字符串的假设,例如PRESENT:。产生名称列表(使用p元素命名的名称)。需要bs 4.7.1 +

from bs4 import BeautifulSoup as bs

html = '''
<h3>Title of Article</h3>    
<p><strong>Section Header 1</strong></p>    
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>   
<p>PRESENT:</p>   
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p>Other<p/>'''

soup = bs(html, 'lxml')
select_html = soup.select('p:contains("PRESENT:") ~ p:not(p:contains("Section Header 2") ~ p, p:contains("Section Header 2"))')
l = [y for x in [i.text.split('\n') for i in select_html] for y in x]
print(l)

enter image description here

答案 1 :(得分:1)

基于标题Trying to get only the text between two strong tags(如果确实需要),则可以使用下面的内容。我们使用CSS级别4 :has()来测试元素包含某些元素,我们使用CSS级别:nth-child(x of s)来定位复合选择器类型的特定实例(在我们的第1和第2 p中,标记和strong标记)。

from bs4 import BeautifulSoup

html = '''
<h3>Title of Article</h3>

<p><strong>Section Header 1</strong></p>

<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>

<p>PRESENT:</p>

<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>

....
....
....

<p>William, Baller</p>

<p>Roy Williams, Coach</p>

<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
....
....
....
....
....
'''

soup = BeautifulSoup(html, 'html.parser')
print(soup.select('p:nth-child(1 of :has(strong)) ~ *:has(~ p:nth-child(2 of :has(strong)))'))

输出:

[<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>, <p>PRESENT:</p>, <p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>, <p>George Jungle, Savage</p>, <p>William, Baller</p>, <p>Roy Williams, Coach</p>]

如果我们确实只想要名称列表,我们将更改选择器以在包含PRESENT:的段落之后开始收集元素:

soup.select('p:contains("PRESENT:") ~ *:has(~ p:nth-child(2 of :has(strong)))')

输出:

[<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>, <p>George Jungle, Savage</p>, <p>William, Baller</p>, <p>Roy Williams, Coach</p>]

此时,您只需提取所需的内容即可。

答案 2 :(得分:1)

按原样回答问题,留下刮刮“文章标题”和“脚注”的机会。您可以使用findChildren()然后分解()除去不需要的元素。从此代码的输出中,您可以轻松提取所需的数据。即使文本“ PRESENT”和“ Section Header”不存在,它也可以工作。如果需要,可以轻松地将其修改为删除第一个“强”标签之前的元素。

from bs4 import BeautifulSoup, element

html = """
<div><p> blah blah</p></div>
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p> blah blah</p>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')
# Pull only the HTML from the article that I am interested in
notes = soup.find('div', attrs = {'id' : 'article'})
counter = 0
# Iterate over children.
for i in notes.findChildren():
    if i.name == "strong":
        counter += 1
        if counter == 2:
            i.parent.decompose()  # Remove the second Strong tag's parent.
    if counter > 1:  # Remove all tags after second Strong tag.
        if isinstance(i, element.Tag):
            i.decompose()
print(notes)

输出:

<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>


</div>