从HTML提取标签之间的特定文本部分

时间:2019-05-08 18:09:26

标签: python python-3.x beautifulsoup

我想从HTML文件(“项目1A”部分)的特定部分提取文本。我想让文本从“项目1A”开始,而不是在内容部分的目录中,然后从“项目1B”停止。但是“项目1A”和“项目1B”有几个相同的文本。如何确定要开始和停止的特定文本。

  <ItemGroup>
    <Content Include="Content\css\redoc-styles.css">
      <CopyToPublishDirectory>PreserveNewest</CopyToPublishDirectory>
    </Content>
    <Content Include="Content\js\redoc-javascript.js">
      <CopyToPublishDirectory>PreserveNewest</CopyToPublishDirectory>
    </Content>
  </ItemGroup>

输出捕获的是内容列表中第一个“ Item 1A”中的文本,而不是该部分的标题。

因此我想知道:

  1. 如何从内容部分的“项目1A”而不是目录中捕获“项目1A”中的文本。

  2. 为什么它捕获了最后一个“ Item 1B”,而不是从目录中的“ Item 1B”处停止。

1 个答案:

答案 0 :(得分:1)

既然您有soup可以帮助您处理HTML的结构,为什么不利用它呢?

一种表达方式是“在具有特定属性的两个标签之间查找文本”。 (表示1A和1B标头的标签。)为此,您可以将可调用的(函数)传递给soup.find()

import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
import re

url = "https://www.sec.gov/Archives/edgar/data/1606163/000114420416089184/v434424_10k.htm"
res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")

def is_pstyle(tag: tag) -> bool:
    return tag.name == "p" and tag.has_attr("style")

def is_i1a(tag: Tag) -> bool:
    return is_pstyle(tag) and re.match(r"Item 1A\..*", tag.text)

def is_i1b(tag: Tag) -> bool:
    return is_pstyle(tag) and re.match(r"Item 1B\..*", tag.text)

def grab_1a_thru_1b(soup: BeautifulSoup) -> str:
    start = soup.find(is_i1a)
    def gen_t():
        for tag in start.next_siblings:
            if is_i1b(tag):
                break
            else:
                if hasattr(tag, "get_text"):
                    yield tag.get_text()  # get_text("\n")
                else:
                    yield str(tag)
    return "".join(gen_t())

if __name__ == "__main__":
    print(grab_1a_thru_1b(soup))

输出的第一部分:

The risks and uncertainties described below
are those specific to the Company which we currently believe have the potential to be material, but they may not be the only ones
we face. If any of the following risks, or any other risks and uncertainties that we have not yet identified or that we currently
consider not to be material, actually occur or become material risks, our business, prospects, financial condition, results of
operations and cash flows could be materially and adversely affected. Investors are advised to consider these factors along with
the other information included in this Annual Report and to review any additional risks discussed in our filings with the SEC.
 
Risks Associated with Our Business
 
We are a newly formed company with no operating history and, accordingly, you have no basis on which to evaluate our ability to achieve our business
objective.

您可以将微型功能is_pstyleis_i1ais_i1b视为“过滤器”-只是精确查找开始和结束标记的不同方法。然后,您遍历这些标签之间的兄弟标签。 (.get_text()将在每个同级标签中递归工作。)