使用beautiful-soup提取特定标签的元素

时间:2020-03-23 14:11:26

标签: python html parsing web-scraping beautifulsoup

我想从特定标签中提取元素。例如-一个站点中有四个。每个标签还有其他同级标签,例如p,h3,h4,ul等。我想分别查看h2 [1]元素和h2 [2]元素。

这是我到目前为止所做的。我知道for循环没有任何意义。我也尝试添加文本,但无法使其成功。然后我尝试按特定的字符串进行搜索,但是它给出了该特定字符串的唯一标记,而不是所有其他元素

from bs4 import BeautifulSoup
page = "https://www.us-cert.gov/ics/advisories/icsma-20-079-01"
resp = requests.get(page)
soup = BeautifulSoup(resp.content, "html5lib")
content_div=soup.find('div', {"class": "content"})
all_p= content_div.find_all('p')
all_h2=content_div.find_all('h2')
i=0
for h2 in all_h2:
  print(all_h2[i],'\n\n')
  print(all_p[i],'\n')
  i=i+1

也尝试使用append

 tags = soup.find_all('div', {"class": "content"})
 container = []
 for tag in tags:
  try:
    container.append(tag.text)
    print(tag.text)
  except:
    print(tag)

我是编程的新手。请原谅我糟糕的编码技巧。我想要的是一起查看“缓解”下的所有内容。这样,如果我要将其存储在数据库中,它将在一列上解析与缓解相关的所有信息。

1 个答案:

答案 0 :(得分:0)

您可以使用["p","ul","h2","div"]findNext来查找标记recursive=False的静态列表,以保持在最高级别:

import requests
from bs4 import BeautifulSoup
import json

resp = requests.get("https://www.us-cert.gov/ics/advisories/icsma-20-079-01")
soup = BeautifulSoup(resp.content, "html.parser")

content_div = soup.find('div', {"class": "content"})

h2_list = [ i for i in content_div.find_all("h2")]
result = []
search_tags = ["p","ul","h2","div"]

def getChildren(tag): 
    text = []
    while (tag):
        tag = tag.findNext(search_tags, recursive=False)
        if (tag is None):
            break
        elif (tag.name == "div") or (tag.name == "h2"):
            break
        else:
            text.append(tag.text.strip())
    return "".join(text)

for i in h2_list:
    result.append({
        "name": i.text.strip(),
        "children": getChildren(i)
    })

print(json.dumps(result, indent=4, sort_keys=True))