我想在python中从html文件中提取标记而不使用BeautifulSoup。例如,我想要
class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine
来自
<a class="el" href="atsc__root__raised__cosine.html" target="_self">atsc_root_raised_cosine</a>
有什么想法吗?
答案 0 :(得分:1)
要进行基本的dom解析,可以使用stl。中的xml parser
这是一个使用它(从文档中)将xml转换为html的示例:
import xml.dom.minidom
document = """\
<slideshow>
<title>Demo slideshow</title>
<slide><title>Slide title</title>
<point>This is a demo</point>
<point>Of a program for processing slides</point>
</slide>
<slide><title>Another demo slide</title>
<point>It is important</point>
<point>To have more than</point>
<point>one slide</point>
</slide>
</slideshow>
"""
dom = xml.dom.minidom.parseString(document)
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
def handleSlideshow(slideshow):
print "<html>"
handleSlideshowTitle(slideshow.getElementsByTagName("title")[0])
slides = slideshow.getElementsByTagName("slide")
handleToc(slides)
handleSlides(slides)
print "</html>"
def handleSlides(slides):
for slide in slides:
handleSlide(slide)
def handleSlide(slide):
handleSlideTitle(slide.getElementsByTagName("title")[0])
handlePoints(slide.getElementsByTagName("point"))
def handleSlideshowTitle(title):
print "<title>%s</title>" % getText(title.childNodes)
def handleSlideTitle(title):
print "<h2>%s</h2>" % getText(title.childNodes)
def handlePoints(points):
print "<ul>"
for point in points:
handlePoint(point)
print "</ul>"
def handlePoint(point):
print "<li>%s</li>" % getText(point.childNodes)
def handleToc(slides):
for slide in slides:
title = slide.getElementsByTagName("title")[0]
print "<p>%s</p>" % getText(title.childNodes)
handleSlideshow(dom)
答案 1 :(得分:1)
看一下python中提供的这个XML API,它解释了如何访问属性,元素以及一些HTML示例。您还可以生成解析器对象。