我需要创建一个Python程序,该程序从标准输入接收HTML文件,并使用regext将哺乳动物下显示的物种名称逐行输出到标准输出。我也不需要输出显示为“ #sequence_only”的项目。
用于标准输入的文件是这样的:
<!DOCTYPE html>
<!-- The following setting enables collapsible lists -->
<p>
<a href="#human">Human</a></p>
<p class="collapse-section">
<a class="collapsed collapse-toggle" data-toggle="collapse"
href=#mammals>Mammals</a>
<div class="collapse" id="mammals">
<ul>
<li><a href="#alpaca">Alpaca</a>
<li><a href="#armadillo">Armadillo</a>
<li><a href="#sequence_only">Armadillo</a> (sequence only)
<li><a href="#baboon">Baboon</a>
<li><a href="#bison">Bison</a>
<li><a href="#bonobo">Bonobo</a>
<li><a href="#brown_kiwi">Brown kiwi</a>
<li><a href="#bushbaby">Bushbaby</a>
<li><a href="#sequence_only">Bushbaby</a> (sequence only)
<li><a href="#cat">Cat</a>
<li><a href="#chimp">Chimpanzee</a>
<li><a href="#chinese_hamster">Chinese hamster</a>
<li><a href="#chinese_pangolin">Chinese pangolin</a>
<li><a href="#cow">Cow</a>
<li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
<div class="gbFooterCopyright">
© 2017 The Regents of the University of California. All
Rights Reserved.
<br>
<a href="https://genome.ucsc.edu/conditions.html">Conditions of
Use</a>
</div>
我的逻辑如下。我想解析href的值。如果该行以
我试图创建用于提取哺乳动物名称的正则表达式。
#!usr/bin/env python3
import sys
import re
html = sys.stdin.readlines()
for line in html:
mammal_name = re.search(r'\"\>(.*?)\<', line)
if mammal_name:
print(mammal_name.group())
我想要这样的输出:
Alpaca
Armadillo
Baboon
我得到如下输出:
">Human<
">Alpaca<
">Armadillo<
">Armadillo<
">Baboon<
我不希望Human输出,因为它所在的行不是以
更新:我的评分员向我显示了这样的消息:“如果将物种名称包含在标签中,那么它将在许多浏览器中以斜体显示,因此想要以斜体显示科学名称的员工可能使用的标签。在任何情况下,它都不适合作为物种名称,因此请删除它。”我想这是关于>(物种名称)<的,所以我需要用其他字符替换> <之间的物种名称,可能是[]并在此之后解析我的正则表达式??
答案 0 :(得分:2)
在这里,我们只想添加两个左边界(<li><a.+?>
)和右边界(<\/.+>
),然后滑动所需的输出并将其保存在$1
捕获组{{1}中}:
()
<li><a.+?>(.+)?<\/.+>
# -*- coding: UTF-8 -*-
import re
string = """
!-- The following setting enables collapsible lists -->
<p>
<a href="#human">Human</a></p>
<p class="collapse-section">
<a class="collapsed collapse-toggle" data-toggle="collapse"
href=#mammals>Mammals</a>
<div class="collapse" id="mammals">
<ul>
<li><a href="#alpaca">Alpaca</a>
<li><a href="#armadillo">Armadillo</a>
<li><a href="#sequence_only">Armadillo</a> (sequence only)
<li><a href="#baboon">Baboon</a>
<li><a href="#bison">Bison</a>
<li><a href="#bonobo">Bonobo</a>
<li><a href="#brown_kiwi">Brown kiwi</a>
<li><a href="#bushbaby">Bushbaby</a>
<li><a href="#sequence_only">Bushbaby</a> (sequence only)
<li><a href="#cat">Cat</a>
<li><a href="#chimp">Chimpanzee</a>
<li><a href="#chinese_hamster">Chinese hamster</a>
<li><a href="#chinese_pangolin">Chinese pangolin</a>
<li><a href="#cow">Cow</a>
<li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
<div class="gbFooterCopyright">
© 2017 The Regents of the University of California. All
Rights Reserved.
<br>
<a href="https://genome.ucsc.edu/conditions.html">Conditions of
Use</a>
</div>
"""
expression = r'<li><a.+?>(.+)?<\/.+>'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match ")
else:
print(' Sorry! No matches!')
如果不需要此表达式,可以在regex101.com中对其进行修改或更改。
jex.im还有助于可视化表达式。
编辑:
要排除YAAAY! "Alpaca" is a match
,我们可以将表达式修改为:
sequence_only
<li.+?#[^s].+?>(.+)?<\/.+>
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
test_str = '''
<!DOCTYPE html>
<!-- The following setting enables collapsible lists -->
<p>
<a href="#human">Human</a></p>
<p class="collapse-section">
<a class="collapsed collapse-toggle" data-toggle="collapse"
href=#mammals>Mammals</a>
<div class="collapse" id="mammals">
<ul>
<li><a href="#alpaca">Alpaca</a>
<li><a href="#armadillo">Armadillo</a>
<li><a href="#sequence_only">Armadillo</a> (sequence only)
<li><a href="#baboon">Baboon</a>
<li><a href="#bison">Bison</a>
<li><a href="#bonobo">Bonobo</a>
<li><a href="#brown_kiwi">Brown kiwi</a>
<li><a href="#bushbaby">Bushbaby</a>
<li><a href="#sequence_only">Bushbaby</a> (sequence only)
<li><a href="#cat">Cat</a>
<li><a href="#chimp">Chimpanzee</a>
<li><a href="#chinese_hamster">Chinese hamster</a>
<li><a href="#chinese_pangolin">Chinese pangolin</a>
<li><a href="#cow">Cow</a>
<li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
<div class="gbFooterCopyright">
© 2017 The Regents of the University of California. All
Rights Reserved.
<br>
<a href="https://genome.ucsc.edu/conditions.html">Conditions of
Use</a>
</div>
'''
regex = r"<li.+?#[^s].+?>(.+)?<\/.+>"
find_matches = re.findall(regex, test_str)
for matches in find_matches:
print(matches)
答案 1 :(得分:1)
使用BeautifulSoup,它是用于html解析的功能强大的软件包:
import re
import codecs
from bs4 import BeautifulSoup as soup
from lxml import html
# Change with your input file
input_html = "D:\/input.html"
with codecs.open(input_html, 'r', "utf-8") as f :
page = f.read()
f.close()
#html parsing
page_soup = soup(page, "html.parser")
#extract document seperator:
divTag = page_soup.find_all("div", {"id": "mammals"})
for tag in divTag:
mammals = tag.find_all("a", href = re.compile(r'#(?!sequence_only$)'))
for tag in mammals:
print(tag.text)
输出:
Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque
答案 2 :(得分:0)
使用re.findall
获取所有标签文本
像这样
pattern = r'<li><a.*>(.*)</a>'
find = re.findall(pattern, string)
if find:
print(find)
输出
['Alpaca', 'Armadillo', 'Armadillo', 'Baboon', 'Bison', 'Bonobo', 'Brown kiwi',
'Bushbaby', 'Bushbaby', 'Cat', 'Chimpanzee', 'Chinese hamster', 'Chinese pangolin',
'Cow', 'Crab-eating_macaque']
答案 3 :(得分:0)
您应在正则表达式中添加一些详细信息,以解析正确的字符串。 Regex test website。
输入:
string = ''' <!DOCTYPE html>
<!-- The following setting enables collapsible lists -->
<p>
<a href="#human">Human</a></p>
<p class="collapse-section">
<a class="collapsed collapse-toggle" data-toggle="collapse"
href=#mammals>Mammals</a>
<div class="collapse" id="mammals">
<ul>
<li><a href="#alpaca">Alpaca</a>
<li><a href="#armadillo">Armadillo</a>
<li><a href="#sequence_only">Armadillo</a> (sequence only)
<li><a href="#baboon">Baboon</a>
<li><a href="#bison">Bison</a>
<li><a href="#bonobo">Bonobo</a>
<li><a href="#brown_kiwi">Brown kiwi</a>
<li><a href="#bushbaby">Bushbaby</a>
<li><a href="#sequence_only">Bushbaby</a> (sequence only)
<li><a href="#cat">Cat</a>
<li><a href="#chimp">Chimpanzee</a>
<li><a href="#chinese_hamster">Chinese hamster</a>
<li><a href="#chinese_pangolin">Chinese pangolin</a>
<li><a href="#cow">Cow</a>
<li><a href="#crab-eating_macaque">Crab-eating_macaque</a>
<div class="gbFooterCopyright">
© 2017 The Regents of the University of California. All
Rights Reserved.
<br>
<a href="https://genome.ucsc.edu/conditions.html">Conditions of
Use</a>
</div>'''
如果要在一个表达式中处理所有文本,则应使用findall()
。 代码:
results = re.findall("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", string)
for s in results:
print(s)
如果要逐行检查,可以使用search()
。 代码:
strings = string.splitlines()
for s in strings:
substring = re.search("<li><a href=\"(?:(?!#sequence_only).)*\">(.*)</a>", s)
if substring:
print(substring.group(1))
输出:
Alpaca
Armadillo
Baboon
Bison
Bonobo
Brown kiwi
Bushbaby
Cat
Chimpanzee
Chinese hamster
Chinese pangolin
Cow
Crab-eating_macaque