我目前正在尝试仅获取在Strong标签的前两次出现之间的HTML文本(名称列表)。
这是我删除的HTML的简短示例
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
....
....
....
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
....
....
....
....
....
她是一些我编写的快速代码,具有计算强标签出现次数的基本逻辑。我知道在第二次发生之后,我想要的所有名称都已被收集
html = requests.get('https://www.somewebsite.com')
soup = BS(html.text, 'html.parser')
#Pull only the HTML from the article that I am interested in
notes = soup.find('div', attrs = {'id' : 'article'})
# Define a function to print true if a string contains <strong>
def findstrong(i):
return "</strong>" in i
# initialize a value for strong, after the second strong I know all the
# names I am interested in have been collected
strong_counts = 0
list_of_names = []
for i in range(len(notes)):
if strong_counts < 2:
note = notes.contents[i]
#make note string so we can use the findstrong function
note_2_str = str(note)
if findstrong(note_2_str) == False:
list_of_names.append(note)
else:
strong_counts += 1
循环起作用并收集第一个强标签之前的所有文本,以及直到强标签的下一次出现为止的所有文本。即
<h3>Title of Article</h3>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
....
....
....
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
这基本上可以实现我想要的功能,但是由于它现在是列表,所以我失去了BeautifulSoup对象的某些功能。是否有BeautifulSoup函数可以帮助我执行此操作或其他选择?还是在将其扩展到多篇文章之前,我应该专注于使此循环更有效?
答案 0 :(得分:2)
基于包含存在使用字符串的假设,例如PRESENT:
。产生名称列表(使用p
元素命名的名称)。需要bs 4.7.1 +
from bs4 import BeautifulSoup as bs
html = '''
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p>Other<p/>'''
soup = bs(html, 'lxml')
select_html = soup.select('p:contains("PRESENT:") ~ p:not(p:contains("Section Header 2") ~ p, p:contains("Section Header 2"))')
l = [y for x in [i.text.split('\n') for i in select_html] for y in x]
print(l)
答案 1 :(得分:1)
基于标题Trying to get only the text between two strong tags
(如果确实需要),则可以使用下面的内容。我们使用CSS级别4 :has()
来测试元素包含某些元素,我们使用CSS级别:nth-child(x of s)
来定位复合选择器类型的特定实例(在我们的第1和第2 p
中,标记和strong
标记)。
from bs4 import BeautifulSoup
html = '''
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
....
....
....
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
....
....
....
....
....
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.select('p:nth-child(1 of :has(strong)) ~ *:has(~ p:nth-child(2 of :has(strong)))'))
输出:
[<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>, <p>PRESENT:</p>, <p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>, <p>George Jungle, Savage</p>, <p>William, Baller</p>, <p>Roy Williams, Coach</p>]
如果我们确实只想要名称列表,我们将更改选择器以在包含PRESENT:
的段落之后开始收集元素:
soup.select('p:contains("PRESENT:") ~ *:has(~ p:nth-child(2 of :has(strong)))')
输出:
[<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>, <p>George Jungle, Savage</p>, <p>William, Baller</p>, <p>Roy Williams, Coach</p>]
此时,您只需提取所需的内容即可。
答案 2 :(得分:1)
按原样回答问题,留下刮刮“文章标题”和“脚注”的机会。您可以使用findChildren()然后分解()除去不需要的元素。从此代码的输出中,您可以轻松提取所需的数据。即使文本“ PRESENT”和“ Section Header”不存在,它也可以工作。如果需要,可以轻松地将其修改为删除第一个“强”标签之前的元素。
from bs4 import BeautifulSoup, element
html = """
<div><p> blah blah</p></div>
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
<p><strong>Section Header 2</strong><br/>
A second paragraph with lots of text and footnotes</p>
<p> blah blah</p>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
# Pull only the HTML from the article that I am interested in
notes = soup.find('div', attrs = {'id' : 'article'})
counter = 0
# Iterate over children.
for i in notes.findChildren():
if i.name == "strong":
counter += 1
if counter == 2:
i.parent.decompose() # Remove the second Strong tag's parent.
if counter > 1: # Remove all tags after second Strong tag.
if isinstance(i, element.Tag):
i.decompose()
print(notes)
输出:
<div id="article">
<h3>Title of Article</h3>
<p><strong>Section Header 1</strong></p>
<p>A paragraph with some information and footnotes<a href="#fn1" title="footnote 1"><sup>1</sup></a><a name="f1"></a></p>
<p>PRESENT:</p>
<p>John Smith, Farmer<br/>
William Dud, Bum<br/>
Luke Brain, Terrible Singer<br/>
Charles Evans, Doctor<br/>
Stanley Fish, Fisher</p>
<p>George Jungle, Savage</p>
<p>William, Baller</p>
<p>Roy Williams, Coach</p>
</div>