我有xml文件。在使用lxml
作为etree
解析后,我可以按如下方式获取所有标记:
root = tree.getroot()
for e in root.iter():
print e.tag
,输出如下:
'{http://www.w3.org/1999/xhtml}html'
'{http://www.w3.org/1999/xhtml}head'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}link'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}meta'
'{http://www.w3.org/1999/xhtml}script'
'{http://www.w3.org/1999/xhtml}body'
'{http://www.w3.org/1999/xhtml}section'
'{http://www.w3.org/1999/xhtml}h1'
'{http://www.w3.org/1999/xhtml}p'
'{http://www.w3.org/1999/xhtml}em'
'{http://www.w3.org/1999/xhtml}section'
'{http://www.w3.org/1999/xhtml}h1'
'{http://www.w3.org/1999/xhtml}p'
'{http://www.w3.org/1999/xhtml}a'
'{http://www.w3.org/1999/xhtml}p'
'{http://www.w3.org/1999/xhtml}p'
我想使用python / lxml / bs4获取一些具有相对路径的元素。例如我想要第二个p
中的第一个section
元素,并且我有以下相对路径:/section[2]/p[1]
。
但是我甚至无法使用以下代码获取所有部分,这些代码返回None
:
xhtml = {http://www.w3.org/1999/xhtml}
section = xhtml + "section"
root.find(section)
编辑:这是原始文件的一部分:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="grammar/rash.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml" prefix="schema: http://schema.org/ prism: http://prismstandard.org/namespaces/basic/2.0/">
<head>
<meta charset="UTF-8"/>
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<link rel="stylesheet" href="css/bootstrap.min.css"/>
<link rel="stylesheet" href="css/rash.css"/>
<script src="js/jquery.min.js"><![CDATA[ ]]></script>
<script src="js/bootstrap.min.js"><![CDATA[ ]]></script>
<script src="js/rash.js"><![CDATA[ ]]></script>
<title>It ROCS! -- The RASH Online Conversion Service</title>
<meta about="#affiliation-1" property="schema:name" content="Department of Computer Science and Engineering, University of Bologna, Italy"/>
<meta about="#affiliation-2" property="schema:name" content="Oxford e-Research Centre, University of Oxford, UK"/>
<meta about="#affiliation-3" property="schema:name" content="Knowledge Media Institute, Open University, UK"/>
<meta property="prism:keyword" content="HTML-based format"/>
<meta property="prism:keyword" content="Scholarly HTML"/>
<meta property="prism:keyword" content="RASH"/>
<script src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"><![CDATA[ ]]></script></head>
<body>
<section role="doc-abstract">
<h1>Abstract</h1>
<p>In this poster paper we introduce the <em>RASH Online Conversion Service</em>, i.e., a Web application that allows the conversion of ODT documents into RASH, a HTML-based markup language for writing scholarly articles, and from RASH into LaTeX. This tool allows authors with no experience in HTML to easily produce HTML-based papers and supports the publishing process by generating also a LaTeX version according to the Springer LNCS and ACM ICPS layouts.</p>
</section>
<section>
<h1>Introduction</h1>
<p>The use of HTML as format for writing scholarly papers and submitting them to scholarly venues is a very popular, discussed and trendy topic within the scholarly domain. This is demonstrated by the existence of several posts within technical mailing lists of the Web community<a href="#ftn0"> </a>, by the birth of W3C community groups on such topic<a href="#ftn3"> </a>, by the development of HTML-based formats for scholarly articles<a href="#ftn4"> </a>, and by the increasing number of events that are experimenting with HTML-based formats for submissions, such as the SAVE-SD<a href="#ftn5"> </a> and LDOW<a href="#ftn6"> </a> workshops at WWW 2016, and the Extended Semantic Web Conference<a href="#ftn7"> </a>.</p>
<p>In order to foster a wider adoption of these formats, frameworks for HTML-based papers should support the needs of all the actors involved in the production, delivery and fruition of scholarly articles, with particular regards to authors and publishers. Hence, this solution calls for a number of requirements that go well beyond those used on the Web. </p>
<p>First of all, it is vital to support authors with a variety of tools to provide for an easy transition to the new format. To this end, authors should be allowed to keep using well-known current word processors rather than adopting HTML and/or pure text editors. We thus need to support the conversion from the main word processor formats (e.g., ODT and OOXML) to HTML formats, in particular when authors use only basic features, such as standard styles for paragraphs and tables. In addition, authors should be given the option to focus on the content and let appropriate tools handle the presentation layer after the conversion into the HTML-based format.</p>
在这个例子中,我想得到以这句话开头的<p>
元素:“使用HTML作为学术写作的格式......”
答案 0 :(得分:0)
BeautifulSoup,不支持XPath表达式,但是你提到的 lxml 确实如此。
您可以使用XPath搜索元素,如下所示:
from lxml import etree
htmlparser = etree.HTMLParser()
tree = etree.parse(html_content, htmlparser)
tree.xpath(xpathselector)