RegEx用于提取HTML标签中的特定textContent

时间:2019-05-20 14:55:30

标签: python html regex regex-group regex-greedy

我需要创建一个Python程序,该程序从标准输入接收HTML文件,并使用regext将哺乳动物下显示的物种名称逐行输出到标准输出。我也不需要输出显示为“ #sequence_only”的项目。

用于标准输入的文件是这样的:

   <!DOCTYPE html>

  <!-- The following setting enables collapsible lists -->
  <p>
  <a href="#human">Human</a></p>

  <p class="collapse-section">
  <a class="collapsed collapse-toggle" data-toggle="collapse" 
  href=#mammals>Mammals</a>
  <div class="collapse" id="mammals">
  <ul>
  <li><a href="#alpaca">Alpaca</a>
  <li><a href="#armadillo">Armadillo</a>
  <li><a href="#sequence_only">Armadillo</a> (sequence only)
  <li><a href="#baboon">Baboon</a>
  <li><a href="#bison">Bison</a>
  <li><a href="#bonobo">Bonobo</a>
  <li><a href="#brown_kiwi">Brown kiwi</a>
  <li><a href="#bushbaby">Bushbaby</a>
  <li><a href="#sequence_only">Bushbaby</a> (sequence only)
  <li><a href="#cat">Cat</a>
  <li><a href="#chimp">Chimpanzee</a>
  <li><a href="#chinese_hamster">Chinese hamster</a>
  <li><a href="#chinese_pangolin">Chinese pangolin</a>
  <li><a href="#cow">Cow</a>
  <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
  <div class="gbFooterCopyright">
  &copy; 2017 The Regents of the University of California. All 
  Rights Reserved.
  <br>
  <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
  Use</a>
  </div>

我的逻辑如下。我想解析href的值。如果该行以

  • 开头,并且href的值从“#”->开始,则它是一个物种名称,我需要提取“> <个字符之间的名称。如果href的值从“ https”开始,则我想要用其他字符重新替换它,而不在最终输出中输入

    我试图创建用于提取哺乳动物名称的正则表达式。

    #!usr/bin/env python3
    
    import sys
    import re
    
    html = sys.stdin.readlines()
    
    for line in html:
    
        mammal_name = re.search(r'\"\>(.*?)\<', line)
    
    if mammal_name:
    
        print(mammal_name.group())
    

    我想要这样的输出:

    Alpaca
    Armadillo
    Baboon
    

    我得到如下输出:

    ">Human<
    ">Alpaca<
    ">Armadillo<
    ">Armadillo<
    ">Baboon<
    

    我不希望Human输出,因为它所在的行不是以

  • 开头。而且,我不希望在输出中出现重复,但是为此,我需要访问href的值,但是我在这一部分中苦苦挣扎。

    更新:我的评分员向我显示了这样的消息:“如果将物种名称包含在标签中,那么它将在许多浏览器中以斜体显示,因此想要以斜体显示科学名称的员工可能使用的标签。在任何情况下,它都不适合作为物种名称,因此请删除它。”我想这是关于>(物种名称)<的,所以我需要用其他字符替换> <之间的物种名称,可能是[]并在此之后解析我的正则表达式??

  • 4 个答案:

    答案 0 :(得分:2)

    在这里,我们只想添加两个左边界(<li><a.+?>)和右边界(<\/.+>),然后滑动所需的输出并将其保存在$1捕获组{{1}中}:

    ()

    测试

    <li><a.+?>(.+)?<\/.+>
    

    输出

    # -*- coding: UTF-8 -*-
    import re
    
    string = """
    !-- The following setting enables collapsible lists -->
      <p>
      <a href="#human">Human</a></p>
    
      <p class="collapse-section">
      <a class="collapsed collapse-toggle" data-toggle="collapse" 
      href=#mammals>Mammals</a>
      <div class="collapse" id="mammals">
      <ul>
      <li><a href="#alpaca">Alpaca</a>
      <li><a href="#armadillo">Armadillo</a>
      <li><a href="#sequence_only">Armadillo</a> (sequence only)
      <li><a href="#baboon">Baboon</a>
      <li><a href="#bison">Bison</a>
      <li><a href="#bonobo">Bonobo</a>
      <li><a href="#brown_kiwi">Brown kiwi</a>
      <li><a href="#bushbaby">Bushbaby</a>
      <li><a href="#sequence_only">Bushbaby</a> (sequence only)
      <li><a href="#cat">Cat</a>
      <li><a href="#chimp">Chimpanzee</a>
      <li><a href="#chinese_hamster">Chinese hamster</a>
      <li><a href="#chinese_pangolin">Chinese pangolin</a>
      <li><a href="#cow">Cow</a>
      <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
      <div class="gbFooterCopyright">
      &copy; 2017 The Regents of the University of California. All 
      Rights Reserved.
      <br>
      <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
      Use</a>
      </div>
    """
    expression = r'<li><a.+?>(.+)?<\/.+>'
    match = re.search(expression, string)
    if match:
        print("YAAAY! \"" + match.group(1) + "\" is a match  ")
    else: 
        print(' Sorry! No matches!')
    

    RegEx

    如果不需要此表达式,可以在regex101.com中对其进行修改或更改。

    enter image description here

    RegEx电路

    jex.im还有助于可视化表达式。

    enter image description here


    编辑:

    要排除YAAAY! "Alpaca" is a match ,我们可以将表达式修改为:

    sequence_only

    Demo

    Python

    <li.+?#[^s].+?>(.+)?<\/.+>
    

    输出

    # coding=utf8
    # the above tag defines encoding for this document and is for Python 2.x compatibility
    
    import re
    
    test_str = '''
    
    <!DOCTYPE html>
    
      <!-- The following setting enables collapsible lists -->
      <p>
      <a href="#human">Human</a></p>
    
      <p class="collapse-section">
      <a class="collapsed collapse-toggle" data-toggle="collapse" 
      href=#mammals>Mammals</a>
      <div class="collapse" id="mammals">
      <ul>
      <li><a href="#alpaca">Alpaca</a>
      <li><a href="#armadillo">Armadillo</a>
      <li><a href="#sequence_only">Armadillo</a> (sequence only)
      <li><a href="#baboon">Baboon</a>
      <li><a href="#bison">Bison</a>
      <li><a href="#bonobo">Bonobo</a>
      <li><a href="#brown_kiwi">Brown kiwi</a>
      <li><a href="#bushbaby">Bushbaby</a>
      <li><a href="#sequence_only">Bushbaby</a> (sequence only)
      <li><a href="#cat">Cat</a>
      <li><a href="#chimp">Chimpanzee</a>
      <li><a href="#chinese_hamster">Chinese hamster</a>
      <li><a href="#chinese_pangolin">Chinese pangolin</a>
      <li><a href="#cow">Cow</a>
      <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
      <div class="gbFooterCopyright">
      &copy; 2017 The Regents of the University of California. All 
      Rights Reserved.
      <br>
      <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
      Use</a>
      </div>
    
    '''
    regex = r"<li.+?#[^s].+?>(.+)?<\/.+>"
    find_matches = re.findall(regex, test_str)
    for matches in find_matches:
        print(matches)
    

    答案 1 :(得分:1)

    使用BeautifulSoup,它是用于html解析的功能强大的软件包:

    import re
    import codecs
    
    from bs4 import BeautifulSoup as soup
    from lxml import html
    
    # Change with your input file 
    input_html = "D:\/input.html"
    
    with codecs.open(input_html, 'r', "utf-8") as f :
        page = f.read()
    f.close()
    #html parsing
    page_soup = soup(page, "html.parser")
    
    #extract document seperator:
    divTag = page_soup.find_all("div", {"id": "mammals"})
    
    for tag in divTag:
        mammals = tag.find_all("a", href = re.compile(r'#(?!sequence_only$)'))
        for tag in mammals:
            print(tag.text)
    

    输出:

    Alpaca
    Armadillo
    Baboon
    Bison
    Bonobo
    Brown kiwi
    Bushbaby
    Cat
    Chimpanzee
    Chinese hamster
    Chinese pangolin
    Cow
    Crab-eating_macaque
    
    
    

    答案 2 :(得分:0)

    使用re.findall获取所有标签文本 像这样

    pattern = r'<li><a.*>(.*)</a>'
    find = re.findall(pattern, string)
    if find:
        print(find)
    

    输出

    ['Alpaca', 'Armadillo', 'Armadillo', 'Baboon', 'Bison', 'Bonobo', 'Brown kiwi', 
    'Bushbaby', 'Bushbaby', 'Cat', 'Chimpanzee', 'Chinese hamster', 'Chinese pangolin', 
    'Cow', 'Crab-eating_macaque']
    

    答案 3 :(得分:0)

    您应在正则表达式中添加一些详细信息,以解析正确的字符串。 Regex test website

    输入:

    string = '''   <!DOCTYPE html>
    
      <!-- The following setting enables collapsible lists -->
      <p>
      <a href="#human">Human</a></p>
    
      <p class="collapse-section">
      <a class="collapsed collapse-toggle" data-toggle="collapse" 
      href=#mammals>Mammals</a>
      <div class="collapse" id="mammals">
      <ul>
      <li><a href="#alpaca">Alpaca</a>
      <li><a href="#armadillo">Armadillo</a>
      <li><a href="#sequence_only">Armadillo</a> (sequence only)
      <li><a href="#baboon">Baboon</a>
      <li><a href="#bison">Bison</a>
      <li><a href="#bonobo">Bonobo</a>
      <li><a href="#brown_kiwi">Brown kiwi</a>
      <li><a href="#bushbaby">Bushbaby</a>
      <li><a href="#sequence_only">Bushbaby</a> (sequence only)
      <li><a href="#cat">Cat</a>
      <li><a href="#chimp">Chimpanzee</a>
      <li><a href="#chinese_hamster">Chinese hamster</a>
      <li><a href="#chinese_pangolin">Chinese pangolin</a>
      <li><a href="#cow">Cow</a>
      <li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
      <div class="gbFooterCopyright">
      &copy; 2017 The Regents of the University of California. All 
      Rights Reserved.
      <br>
      <a href="https://genome.ucsc.edu/conditions.html">Conditions of 
      Use</a>
      </div>'''
    

    如果要在一个表达式中处理所有文本,则应使用findall()代码:

    results = re.findall("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", string)
    for s in results:
        print(s)
    

    如果要逐行检查,可以使用search()代码:

    strings = string.splitlines()
    for s in strings:
        substring = re.search("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", s)
        if substring:
            print(substring.group(1))
    

    输出:

    Alpaca
    Armadillo
    Baboon
    Bison
    Bonobo
    Brown kiwi
    Bushbaby
    Cat
    Chimpanzee
    Chinese hamster
    Chinese pangolin
    Cow
    Crab-eating_macaque