使用 BeautifulSoup 获取标签和文本

时间:2021-04-29 07:53:55

标签: python beautifulsoup tags screen-scraping

我现在尝试了一段时间,但被卡住了。我的网站具有以下结构(不幸的是我只有截图,不知何故我无法复制粘贴代码...)

编辑:抱歉,当然,这是其中一个 URL:

<块引用>

https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system

enter image description here

我找到了 div class="field field etc.... 我想将所有内容存储在 'strong' 或 "h4" 中作为数据框列名(得到那部分)和相应的文本。我部分成功了,我只丢失了“项目目标”下的第二个

标签内容,而我完全丢失了“合作伙伴”和
标签之间的文本。 这就是我所做的:

content = soup.find_all('div', class_='field field--text_default field--body')

# For the headers:
headers = content[0].find_all(["strong","h4"])
col_names = []
for header in headers:
    col_names.append(header.text)

# and for the content:
con = []
divs = content[0].findAll(["strong", "h4"])
for el in divs:
    con.append(el.nextSibling)
con = [el.string for el in inhalt if el != None]

2 个答案:

答案 0 :(得分:1)

遵循 furas 并与儿童一起工作,我再次发现以下是部分解决方案:

headers, inhalt = [],[]
tag = content[0].find_all(["p","h4"])
for i in range(len(tag)):
    for child in tag[i].children:
        if type(child) == bs4.element.Tag:
            if child.name == "strong":
                headers.append(child.get_text().strip(": "))
                #print("\n",type(child), " ",child.name, child, child.get_text().strip(": "))
        if type(child) == bs4.element.NavigableString:
            if child == "Project Objective" or child == "Project Impact" or child == "Contacts":
                headers.append(child)
            else:
                inhalt.append(child)

不幸的是,我必须将一次 3 个孩子和一次两个孩子放在一个标题中。三者总是以“--”开头,这样应该不会太难,但是如何选择进入一个单元格的两个单独的

答案 1 :(得分:1)

这是@Sebastian版本的修改。

我将所有内容都放在一个列表 data 中,成对 (header, text) 但我不会将其直接添加到此列表中。

当我找到 header 时,我将它保留在分隔变量 header 中。当我找到 text 时,我也将它保留在单独的列表 text 中。只有当我找到下一个 header 时,我才会将前一个 header, text 添加到 data。最后我必须将最后一个 header, text 添加到 data。我还使用 header = None 来识别我是否找到了 fist 标头而不添加空对 header, text

因为我将所有 text 保留为列表,以便我以后可以决定是要显示在一行中还是分开的行中(例如 -- 中的 Partners

我还为 <a> 添加了代码以获取电子邮件地址。我还想为 <br> 添加代码。

import requests
import bs4
from bs4 import BeautifulSoup as BS

url = 'https://www.energy.gov/eere/buildings/downloads/new-iglu-high-efficiency-vacuum-insulated-panel-modular-building-system'

r = requests.get(url)

soup = BS(r.text, 'html.parser')

content = soup.find_all('div', class_='field field--text_default field--body')
#print(content)

data = []   # list for pairs `(header, text)`

header = None  # last found `header`
text = []      # all text found after last `header`


all_tags = content[0].find_all(["p","h4"])

for tag in all_tags:

    for child in tag.children:
        if isinstance(child, bs4.element.Tag):
            if child.name in "strong":
                # put previouse `header + text`
                if header is not None:  # don't before first header
                    data.append( [header, text] )

                # remember new `header` and make place for new text
                header = child.get_text().strip(": ")
                text = []

            #if child.name in "br":
            #    text.append('\n')
                
            if child.name in "a":
                text.append(child.get_text().strip())

        if isinstance(child, bs4.element.NavigableString):
            if child in ("Project Objective", "Project Impact", "Contacts"):
                # put previouse `header + text`
                if header is not None:  # don't before first header
                    data.append( [header, text] )

                # remember new `header` and make place for new text
                header = child.strip()
                text = []
            else:
                # remember `text`
                text.append(child.strip())

# add last `header + text`
if header is not None:  # don't before first header
    data.append( [header, text] )

# --- display ---

print('len(data):', len(data), '\n')

for header, text in data:
    print('header:', header)
    print('--- text ---')
    #print(' '.join(text).strip('\n'))
    if header == 'Partners':
        print('\n'.join(text))
    else:        
        print(' '.join(text))
    print('====================================')

结果:

只有标题 Contact 是空的,因为元素在标题 DOE Technology ManagerLead Performer

len(data): 11 

header: Lead Performer
--- text ---
Cold Climate Housing Research Center – Fairbanks, AK
====================================
header: Partners
--- text ---
-- Panasonic Corp. – Newark, NJ
-- Taġiuġmiullu Nunamiullu Housing Authority – Utqiagvik, AK
-- National Renewable Energy Laboratory, Golden, CO
====================================
header: DOE Total Funding
--- text ---
$375,161
====================================
header: Cost Share
--- text ---
$95,293
====================================
header: Project Term
--- text ---
July 2020 – May 2022
====================================
header: Funding Type
--- text ---
Advanced Building Construction FOA Award
====================================
header: Project Objective
--- text ---
Vacuum insulated panels (VIPs) are poised to transform the building industry by making homes more energy efficient with little additional upfront cost. However, they are currently uncommon due to their inherent fragility. As the R-value relies on the vacuum inside the panel, any damage to the panel negates the insulation value of the system. With today’s residential construction methods and fastener technology, it is nearly impossible to avoid damaging panels during assembly or over the life of the home. These issues make VIPs incompatible with current construction techniques. To overcome these issues and capitalize on the high R-value of VIPs, the project team will develop a new building system with durable assemblies that can perform in Arctic conditions. The long-term plan is to make the system a mass-market building platform that can address the need for affordable, high-efficiency housing across the nation. This starts with a proof of concept that will be built and tested at the Cold Climate Housing Research Center in Fairbanks, Alaska. Developing this concept in the country’s only Arctic state, which has the coldest temperatures and highest energy costs in the U.S., will ensure its durability and performance in other climates.
====================================
header: Project Impact
--- text ---
The energy-savings payback of this system is estimated to be eight years with applicability and potential benefit in every U.S. climate zone. For remote regions such as central Alaska, the payback would be even shorter as the cost of energy exceeds the assumed retail energy cost. Considering the building envelope alone, this system is expected to achieve a reduction in heating/cooling energy of at least 48% and an annual savings of 1,637 TBtu if implemented nationwide.
====================================
header: Contacts
--- text ---

====================================
header: DOE Technology Manager
--- text ---
Marc LaFrance, Marc.Lafrance@ee.doe.gov 
====================================
header: Lead Performer
--- text ---
Bruno Grunau, Cold Climate Housing Research Center
====================================