通过网络抓取计数HTML标记的数量

时间:2019-11-11 17:48:05

标签: python html python-3.x parsing beautifulsoup

我的输出应该是使用的每个标题标签(页面上使用的“ H1”-“ H6”标题标签),段落,图像和链接的单独总数

我得到一个错误编号的输出,它根本找不到H标签,计数器为标头标签输出1。如何计算正确数量的html标签?

import re
from bs4 import BeautifulSoup
import requests
from collections import Counter
from string import punctuation



#main program



    link_url = input("Please Enter the website address ")
#retrieves url for parsing
    r = requests.get(link_url)

    b_soup = BeautifulSoup(r.content, features="html.parser")

#Searaching/parsing for various sized header content
    headerH1 = headH2 = headerH3 = headerH4 = headerH5 = headerH6 = 0

    for header_tags in b_soup.findAll():

        if(header_tags.name == "H1" or header_tags.name == "<H1>"):

         headerH1 = headerH1+1

    if(header_tags.name == "H2" or header_tags.name == "<H2 >"):

        headH2 = headH2+1

    if(header_tags.name == "H3" or header_tags.name == "<H3 >"):

        headerH3 = headerH3+1

    if(header_tags.name == "H4" or header_tags.name == "<H4 >"):

        headerH4 = headerH4+1

    if(header_tags.name == "H5" or header_tags.name == "<H5 >"):

        headerH5 = headerH5+1

    if(header_tags.name == "H6" or header_tags.name == "<H6 >"):

        headerH6 = headerH6+1

    print("Total Headings in H1: ", headerH1)

    print("Total Headings in H2: ", headH2)

    print("Total Headings in H3: ", headerH3)

    print("Total HeadingS in H4: ", headerH4)

    print("Total Headings in H4: ", headerH5)

    print("Total Headings in H5: ", headerH6)



    count = 0
#counting number of paragraphs
    for header_tags in b_soup.findAll():

        if(header_tags.name == 'p' or header_tags.name == '<p>'):

            count = count+1

    print("Paragraphs: ", count)


#counting image total
    for img in b_soup.findAll():

        if(img.name == 'img'):

            count = count+1

    print("Images: ", count)

    count = 0
#counting number of links
    for link in b_soup.find_all('a', href=True):

        count = count+1

    print("Links: ", count)


我的输出


Total Headings in H1:  1
Total Headings in H2:  1
Total Headings in H3:  1
Total HeadingS in H4:  1
Total Headings in H4:  1
Total Headings in H5:  1
Paragraphs:  23
Images:  33
Links:  70

我使用的网站的正确输出实际上也应该相似

Number of H1 Headings: 9

Number of images on this page: 10 

您不需要我使用的网站,可以使用任何链接进行测试。

1 个答案:

答案 0 :(得分:3)

下面是一个示例,用于计算一些HTML代码中的<h1>标签的数量:

from bs4 import BeautifulSoup
html = "<h1>first</h1><h1>second</h1><h2>third</h2>"
soup = BeautifulSoup(html, 'html.parser')
h1s = soup.find_all('h1')
h1_count = len(h1s) # Gets the number of <h1> tags

在此示例中,h1_count为2。

您可以通过替换h1中的find_all('h1')对其他标签类型执行相同的操作:

h2s = soup.find_all('h2')
h3s = soup.find_all('h3')
...
h2_count = len(h2s)
h3_count = len(h3s)

希望这会有所帮助。