使用Python使用Stack实现提取html标签

时间:2018-09-09 00:14:43

标签: python html python-3.x html-parsing

  
      
  1. 一次从文件读取一个字符,忽略所有内容以获取“ <”(也忽略“ <”)。

  2.   
  3. 一次读取一个字符,然后将它们附加到字符串中,直到“>”或空白(也忽略“>”)

  4.   
     

预期输出应为:[.... html,body,h1,/ h1,/ h2,/ body,.....]

从文档中获取所有标签

<html>
<head>
    <title>Title</title>
</head>
<body>
    <p><strong><em>Q2. HTML TAG CHECKER</em></strong></p>
    <p></p>
    <p>A <em>markup language</em> is a language that annotates text so that the
    computer can manipulate the text. Most markup languages are human readable
    because the annotations are written in a way to distinguish them from the
    text. The most important feature of a markup language is that the
    <em>tags</em> it uses to indicate annotations should be easy to distinguish
    from the document <em>content</em>.</p>
    <p>One of the most well-known markup languages is the one commonly used to
    create web pages, called <strong>HTML</strong>, or "Hypertext Markup
    Language". In HTML, tags appear in "angle brackets" such as in
    "&lt;html&gt;". When you load a Web page in your browser, you do not see
    the tags themselves: the browser interprets the tags as instructions on how
    to format the text for display.</p>
    <p>Most tags in HTML are used in pairs to indicate where an effect starts
    and ends. For example:</p>
    <p>&lt;p&gt;
    this is a paragraph of text written in HTML
    &lt;/p&gt;</p>
    <p>Here &lt;p&gt; represents the start of a paragraph, and &lt;/p&gt;
    indicates where that paragraph ends.</p>
    <p>Other tags include &lt;b&gt; and &lt;/b&gt; that are used to place the
    enclosed text in <strong>bold</strong> font, and &lt;i&gt; and &lt;/i&gt;
    indicate that the enclosed text is <em>italic</em>.</p>
    <p>Note that "end" tags look just like the "start" tags, except for the
    addition of a backslash &lsquo;/&rsquo;after the &lt; symbol.</p>
    <p>Sets of tags are often nested inside other sets of tags. For example, an
    <em>ordered list</em> is a list of numbered bullets. You specify the start
    of an ordered list with the tag &lt;ol&gt;, and the end with &lt;/ol&gt;.
    Within the ordered list, you identify items to be numbered with the tags
    &lt;li&gt; (for "list item") and &lt;/li&gt;. For example, the following
    specification:</p>
    <p>&lt;ol&gt;</p>
    <p>&lt;li&gt;First item&lt;/li&gt;</p>
    <p>&lt;li&gt;Second item&lt;/li&gt;</p>
    <p>&lt;li&gt;Third item&lt;/li&gt;</p>
    <p>&lt;/ol&gt;</p>
    <p>would result in the following:</p>
    <ol>
        <li>First item</li>
        <li>Second item</li>
        <li>Third item</li>
    </ol>

Stack.py:

class Stack:
    def __init__(self):
        self.items = []
    def is_empty(self):
        return self.items == []
    def size(self):
        return len(self.items)
    def push(self, item):
        self.items.append(item)
    def pop(self):
        return self.items.pop()
    def peek(self):
        return self.items[-1]

    #Returns string representation of contents of stack
    def __str__(self):
        return

main.py

from Stack import Stack

#Processes HTML file and returns list of HTML tag objects
def process_html_file(file_name):
    tag_list = []
    s =Stack()
    with open(file_name, 'r') as f:
        all_lines = []
        # loop through all lines using f.readlines() method
        for line in f.readlines():
            new_line = []
            # this is how you would loop through each alphabet
            for chars in line:
                new_line.append(chars)
            all_lines.append(new_line)

0 个答案:

没有答案