提取<p>标记之间的文本块,以<br>分隔

时间:2019-07-08 21:23:37

标签: python python-3.x beautifulsoup

我想解析以下示例中的所有文本块(TEXT CONTENT,BODY CONTENT和EXTRA CONTENT)。您可能会注意到,所有这些文本块在每个“ p”标记内的位置都不同。

<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>

我想以表格格式显示最终结果:

       Col1             Col2               Col3
TITLE CONTENT #1     BODY CONTENT #1     EXTRA CONTENT #1
TITLE CONTENT #2     BODY CONTENT #2     EXTRA CONTENT #2
TITLE CONTENT #3     BODY CONTENT #3     EXTRA CONTENT #3

我尝试过

 for i in soup.find_all('p'):
     title = i.find('strong')
     if not isinstance(title.nextSibling, NavigableString):
         body= title.nextSibling.nextSibling
         extra= body.nextSibling.nextSibling
     else:
         if len(title.nextSibling) > 3:
             body= title.nextSibling
             extra= body.nextSibling.nextSibling
         else:
             body= title.nextSibling.nextSibling.nextSibling
             extra= body.nextSibling.nextSibling

但是看起来效率不高。我想知道是否有人有更好的解决方案?
任何帮助将不胜感激!

谢谢!

3 个答案:

答案 0 :(得分:1)

请务必注意,.next_sibling也可以正常工作,因为您可能需要收集多个文本节点,因此必须使用一些逻辑来知道调用它的次数。在此示例中,我发现仅浏览后代就更容易了,这些后代注意到了一些重要特征,这些特征暗示着我要做一些不同的事情。

您只需要分解要抓取的特征。在这种简单的情况下,我们知道:

  1. 当我们看到strong元素时,我们想要捕获“标题”。
  2. 当我们看到第一个br元素时,我们想开始捕获“内容”。
  3. 当我们看到第二个br元素时,我们想开始捕获“额外内容”。

我们可以:

  1. 定位plans类以获取所有计划。
  2. 然后,我们可以遍历plans的所有后代节点。
  3. 如果看到标签,请查看其是否符合上述条件之一,并准备在正确的容器中捕获文本节点。
  4. 如果我们看到一个文本节点,并且已经准备好一个容器,请存储文本。
  5. 剥离不必要的前导和尾随空白,并存储计划的数据。
from bs4 import BeautifulSoup as bs
from bs4 import Tag, NavigableString

html = """
<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>
"""

soup = bs(html, 'html.parser')

content = []

# Iterate through all the plans
for plans in soup.select('.plans'):
    # Lists that will hold the text nodes of interest
    title = []
    body = []
    extra = []

    current = None  # Reference to  one of the above lists to store data
    br = 0  # Count number of br tags

    # Iterate through all the descendant nodes of a plan
    for node in plans.descendants:
        # See if the node is a Tag/Element
        if isinstance(node, Tag):
            if node.name == 'strong':
                # Strong tags/elements contain our title
                # So set the current container for text to the the title list
                current = title
            elif node.name == 'br':
                # We've found a br Tag/Element
                br += 1
                if br == 1:
                    # If this is the first, we need to set the current
                    # container for text to the body list
                    current = body
                elif br == 2:
                    # If this is the second, we need to set the current
                    # container for text to the extra list
                    current = extra
        elif isinstance(node, NavigableString) and current is not None:
            # We've found a navigable string (not a tag/element), so let's
            # store the text node in the current list container.
            # NOTE: You may have to filter out things like HTML comments in a real world example.
            current.append(node)

    # Store the captured title, body, and extra text for the current plan.
    # For each list, join the text into one string and strip leading and trailing whitespace
    # from each entry in the row.
    content.append([''.join(entry).strip() for entry in (title, body, extra)])

print(content)

然后您可以随时打印数据,但是您应该以一种很好的逻辑方式捕获数据,如下所示:

[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]

执行此操作的方法有多种,这只是一种。

答案 1 :(得分:0)

使用切片的另一种方式,假设您的列表不可变

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"), "html.parser")

def slicing(l):
     new_list = []
     for i in range(0,len(l),3):
             new_list.append(l[i:i+3])
     return new_list

result = slicing(list(soup.stripped_strings))
print(result)

输出

[['TITLE CONTENT #1', 'BODY CONTENT #1', 'EXTRA CONTENT #1'], ['TITLE CONTENT #2', 'BODY CONTENT #2', 'EXTRA CONTENT #2'], ['TITLE CONTENT #3', 'BODY CONTENT #3', 'EXTRA CONTENT #3']]

答案 2 :(得分:0)

在这种情况下,您可以将BeautifulSoup的get_text()方法与separator=参数一起使用:

data = '''<p class="plans">
      <strong>
       TITLE CONTENT #1
      </strong>
      <br/>
      BODY CONTENT #1
      <br/>
      EXTRA CONTENT #1
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #2
       <br/>
      </strong>
      BODY CONTENT #2
      <br/>
      EXTRA CONTENT #2
</p>

<p class="plans">
      <strong>
       TITLE CONTENT #3
      </strong>
      <br/>
      BODY CONTENT #3
      <br/>
      EXTRA CONTENT #3
</p>'''


from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

print('{: ^25}{: ^25}{: ^25}'.format('Col1', 'Col2', 'Col3'))
for p in [[i.strip() for i in p.get_text(separator='|').split('|') if i.strip()] for p in soup.select('p.plans')]:
    print(''.join('{: ^25}'.format(i) for i in p))

打印:

      Col1                     Col2                     Col3           
TITLE CONTENT #1          BODY CONTENT #1         EXTRA CONTENT #1     
TITLE CONTENT #2          BODY CONTENT #2         EXTRA CONTENT #2     
TITLE CONTENT #3          BODY CONTENT #3         EXTRA CONTENT #3