Question

<body>
  <p class="title">
    <b>
      The Dormouse's story
    </b>
  </p>
  <p class="story">
    ....
    <b>
      A tale
    </b>
  </p>  
</body>

我需要标记<body>的所有直接孩子，而不是大孩子。因此，在这种情况下，它只应输出和。

我找到的最接近的方法是输出标签和他们所有的孩子。我该怎么办呢？

Answer 1

首先，您可以使用[#:maker{:name "Honda", :cars {1 #:car{:name "x"} 2 #:car{:name "y"} 3 #:car{:name "z"}}]获取所有子标记。 recursive=False为您提供标记的直接子项。然后，我唯一做的就是将数据格式化为字符串。

我在标签上添加了一些属性，以表明它适用于所有情况。

find_all(recursive=False)

输出：

html = '''
<body>
  <p class="title" id="title">
    <b>
      The Dormouse's story
    </b>
  </p>
  <p class="story stories">
    ....
    <b>
      A tale
    </b>
  </p>  
</body>
'''

soup = BeautifulSoup(html, 'lxml')

for tag in soup.body.find_all(recursive=False):
    attributes = ' '.join('{}="{}"'.format(
        key, 
        ' '.join(value) if isinstance(value, list) else value
    ) for key, value in tag.attrs.items())

    tag_string = '<{} {}>'.format(tag.name, attributes)
    print(tag_string)

我使用 而非直接使用' '.join(value) if isinstance(value, list) else value的原因是value的属性在列表中可用。

Answer 2

如果你想要的只是提取class标签，那么就可以了：

s = '''<body>
    <p class="title">
        <b>
        The Dormouse's story
        </b>    
    <p class="story">
        ....
            <b>
        A tale
            </b>
    </p>    
</body>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(s, 'html.parser')

for i in soup.find_all('p'):
    print(i.get('class'))

输出：

['title']
['story']

或者您可以使用正则表达式来返回整个标记：

import re

print(re.findall(r'(?:<p).*?(?:>)', str(soup)))

输出：

['<p class="title">', '<p class="story">']

Answer 3

我现在对我的问题得到了一个凌乱而丑陋的答案，有点像这样

soup = BeautifulSoup(a,'html5lib')
list = []
for child in soup.body.children:
list.append(child)

text = str(list[1])
x, y, z = text.partition('>')
a = x+y
print (a)

text = str(list[2])
x2, y2, z2 = text.partition('>')
a2 = x2+y2
print (a2)

现在就解决我的问题了，这只会显示

<p class="title">

和

如果有人有更好或更整洁的解决方案，我们表示赞赏谢谢大家：）

Answer 4

from bs4 import BeautifulSoup
import re

HTML='''<body>...'''

soup = BeautifulSoup(HTML,'lxml').body
child= soup.find_next(lambda x: re.search('<',str(x)))
print(child)
print(child.find_next_sibling(lambda x: re.search('<',str(x))))

soup.find_next（）查找下一个元素（仅限下一个元素），因为你想在不知道标签的情况下找到下一个标签（搜索＆＃34;＆lt;＆＃34;，如果返回true，则抓取那个元素）。 soup.find_next_sibling（）将找到标签的下一个兄弟（即下一个横向标签，在这种情况下为class =＆＃34; story＆＃34;

只获得美丽汤中的直接元素

4 个答案: