我的输出应该是使用的每个标题标签(页面上使用的“ H1”-“ H6”标题标签),段落,图像和链接的单独总数
我得到一个错误编号的输出,它根本找不到H标签,计数器为标头标签输出1。如何计算正确数量的html标签?
import re
from bs4 import BeautifulSoup
import requests
from collections import Counter
from string import punctuation
#main program
link_url = input("Please Enter the website address ")
#retrieves url for parsing
r = requests.get(link_url)
b_soup = BeautifulSoup(r.content, features="html.parser")
#Searaching/parsing for various sized header content
headerH1 = headH2 = headerH3 = headerH4 = headerH5 = headerH6 = 0
for header_tags in b_soup.findAll():
if(header_tags.name == "H1" or header_tags.name == "<H1>"):
headerH1 = headerH1+1
if(header_tags.name == "H2" or header_tags.name == "<H2 >"):
headH2 = headH2+1
if(header_tags.name == "H3" or header_tags.name == "<H3 >"):
headerH3 = headerH3+1
if(header_tags.name == "H4" or header_tags.name == "<H4 >"):
headerH4 = headerH4+1
if(header_tags.name == "H5" or header_tags.name == "<H5 >"):
headerH5 = headerH5+1
if(header_tags.name == "H6" or header_tags.name == "<H6 >"):
headerH6 = headerH6+1
print("Total Headings in H1: ", headerH1)
print("Total Headings in H2: ", headH2)
print("Total Headings in H3: ", headerH3)
print("Total HeadingS in H4: ", headerH4)
print("Total Headings in H4: ", headerH5)
print("Total Headings in H5: ", headerH6)
count = 0
#counting number of paragraphs
for header_tags in b_soup.findAll():
if(header_tags.name == 'p' or header_tags.name == '<p>'):
count = count+1
print("Paragraphs: ", count)
#counting image total
for img in b_soup.findAll():
if(img.name == 'img'):
count = count+1
print("Images: ", count)
count = 0
#counting number of links
for link in b_soup.find_all('a', href=True):
count = count+1
print("Links: ", count)
我的输出
Total Headings in H1: 1
Total Headings in H2: 1
Total Headings in H3: 1
Total HeadingS in H4: 1
Total Headings in H4: 1
Total Headings in H5: 1
Paragraphs: 23
Images: 33
Links: 70
我使用的网站的正确输出实际上也应该相似
Number of H1 Headings: 9
Number of images on this page: 10
您不需要我使用的网站,可以使用任何链接进行测试。
答案 0 :(得分:3)
下面是一个示例,用于计算一些HTML代码中的<h1>
标签的数量:
from bs4 import BeautifulSoup
html = "<h1>first</h1><h1>second</h1><h2>third</h2>"
soup = BeautifulSoup(html, 'html.parser')
h1s = soup.find_all('h1')
h1_count = len(h1s) # Gets the number of <h1> tags
在此示例中,h1_count
为2。
您可以通过替换h1
中的find_all('h1')
对其他标签类型执行相同的操作:
h2s = soup.find_all('h2')
h3s = soup.find_all('h3')
...
h2_count = len(h2s)
h3_count = len(h3s)
希望这会有所帮助。