Question

我有以下HTML

PascalCase

我必须将标题和文字分开，例如，母亲： Diane ..

所以最后我会有一个字典列表：

<p>
<b>Father:</b> Michael Haughton
<br>
<b>Mother:</b> Diane
<br><b>Brother:</b> 
Rashad Haughton<br>
<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year)
<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>

我正在尝试下面但没有工作：

[{"label":"Mother","value":"Diane"}]

Answer 1

from bs4 import BeautifulSoup
from urllib.request import urlopen

#html = '''<p>
#<b>Father:</b> Michael Haughton
#<br>
#<b>Mother:</b> Diane
#<br><b>Brother:</b> 
#Rashad Haughton<br>
#<b>Husband:</b> <a href="/people/540/000024468/">R. Kelly</a> (m. 1994, annulled that same year)
#<br><b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a> (Roc-a-Fella co-CEO)<br></p>'''

page = urlopen('http://www.nndb.com/people/742/000024670/')
source = page.read()

soup = BeautifulSoup(source)

needed_p = soup.find_all('p')[8]

bs = needed_p.find_all('b')

res = {}

for b in bs:
    if b.find_next('a').text:
        res[b.text] = b.find_next('a').text.strip().strip('\n')
    if b.next_sibling != ' ':
        res[b.text] = b.next_sibling.strip().strip('\n')

res

输出：

{'Brother:': 'Rashad Haughton',
 'Mother:': 'Diane',
 'Husband:': 'R. Kelly',
 'Father:': 'Michael Haughton',
 'Boyfriend:': 'Damon Dash'}

修改：有关页面顶部的其他信息：

... (code above) ... soup = BeautifulSoup(source) needed_p = soup.find_all('p')[1:4] + [soup.find_all('p')[8]] # here explicitly selecting needed p-tags for further parsing res = {} for p in needed_p: bs = p.find_all('b') for b in bs: if b.find_next('a').text: res[b.text] = b.find_next('a').text.strip().strip('\n') if b.next_sibling != ' ': res[b.text] = b.next_sibling.strip().strip('\n') res

输出：

{'Race or Ethnicity:': 'Black', 'Husband:': 'R. Kelly', 'Died:': '25-Aug', 'Nationality:': 'United States', 'Executive summary:': 'R&B singer, died in plane crash', 'Mother:': 'Diane', 'Birthplace:': 'Brooklyn, NY', 'Born:': '16-Jan', 'Boyfriend:': 'Damon Dash', 'Sexual orientation:': 'Straight', 'Occupation:': 'Singer', 'Cause of death:': 'Accident - Airplane', 'Brother:': 'Rashad Haughton', 'Remains:': 'Interred,', 'Gender:': 'Female', 'Father:': 'Michael Haughton', 'Location of death:': 'Marsh Harbour, Abaco Island, Bahamas'}

对于正是这个页面，你也可以这样刮高中：

res['High School'] = soup.find_all('p')[9].text.split(':')[1].strip()

Answer 2

您正在寻找next_sibling代码属性。这会为您提供下一个NavigableString或下一个Tag，具体取决于它首先找到的内容。

以下是如何使用它：

html = """..."""            
soup = BeautifulSoup(html)

bTags = soup.find_all('b')
for it_tag in bTags:
    print(it_tag.string)
    print(it_tag.next_sibling)

输出：

Father:
 Michael Haughton

Mother:
 Diane

Brother:

Rashad Haughton
Husband:

Boyfriend:

这似乎有点过时了。部分原因在于换行和空白，您可以使用str.strip方法轻松摆脱它。

仍然，Boyfriend和Husband条目缺少值。这是因为next_sibling是NavigableString（即str）或Tag。此处<b>标记与<a>标记之间的空白被解释为非空文本：

<b>Boyfriend:</b> <a href="/people/420/000109093/">Damon Dash</a>
                 ^

如果缺席，<b>Boyfriend:</b>的下一个兄弟将是<a>标记。既然它存在，你必须检查：

下一个兄弟是字符串还是标签;
如果是字符串，是否只包含空格。

如果下一个兄弟是一个仅限空格的字符串，那么您要查找的信息就是NavigableString的下一个兄弟，这将是<a>标记。

编辑代码：

bTags = soup.find_all('b')

for it_tag in bTags:
    print(it_tag.string)

    nextSibling = it_tag.next_sibling
    if nextSibling is not None:
        if isinstance(nextSibling, str):
            if nextSibling.isspace():
                print(it_tag.next_sibling.next_sibling.string.strip())
            else:
                print(nextSibling.strip())

        elif isinstance(it_tag.next_sibling, bs4.Tag):
            print(it_tag.next_sibling.string)

输出：

Father:
Michael Haughton
Mother:
Diane
Brother:
Rashad Haughton
Husband:
R. Kelly
Boyfriend:
Damon Dash

现在您可以轻松构建词典：

entries = {}
bTags = soup.find_all('b')

for it_tag in bTags:
    key = it_tag.string.replace(':', '')
    value = None

    nextSibling = it_tag.next_sibling
    if nextSibling is not None:
        if isinstance(nextSibling, str):
            if nextSibling.isspace():
                value = it_tag.next_sibling.next_sibling.string.strip()
            else:
                value = nextSibling.strip()

        elif isinstance(it_tag.next_sibling, bs4.Tag):
            value = it_tag.next_sibling.string

    entries[key] = value

输出字典：

{'Father': 'Michael Haughton',
 'Mother': 'Diane',
 'Brother': 'Rashad Haughton',
 'Husband': 'R. Kelly',
 'Boyfriend': 'Damon Dash'}

Python Beautiful Soup：如何提取标签旁边的文字？

2 个答案: