我的代码只返回网页正文中的文字。我正在尝试从此页面正文中删除class="menu"
项目中的文字:
<div id="pre-header-links-inner" class="header-links"><ul id="menu-top-bar" class="menu"><li id="menu-item-22" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-22"><a href="tel:000-000-0000">Main Line: +1 000-000-0000</a></li>
<li id="menu-item-23" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-23"><a href="tel:100000000000">Sales: tel:000-000-0000</a></li>
<li id="menu-item-24" class="menu-item menu-item-type-custom menu-item-object-custom menu-item-24"><a href="mailto:info@example.com">Email: info@example.com</a></li>
</ul></div>
</div>
</div>
</div>
<!-- #pre-header -->
<div id="header">
<div id="header-core">
<div id="logo">
<a href="https://www.example.com/" class="custom-logo-link" rel="home" itemprop="url"><img width="253" height="50" src="https://www.example.com/logo.png" class="custom-logo" alt="Domain" itemprop="logo" /></a> </div>
<div id="header-links" class="main-navigation">
<div id="header-links-inner" class="header-links">
<ul id="menu-main-navigation" class="menu"><li id="menu-item-71" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item"><a href="https://www.example.com/"><span>Home</span></a></li>
<li id="menu-item-70" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com"><span>About Us</span></a></li>
<li id="menu-item-108" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/services/"><span>Services</span></a></li>
<li id="menu-item-124" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/api/"><span>API</span></a></li>
<li id="menu-item-68" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/contact-us/"><span>Contact Us</span></a></li>
</ul>
</div>
</div>
<!-- #header-links .main-navigation -->
<div id="header-nav"><a class="btn-navbar" data-toggle="collapse" data-target=".nav-collapse"><span class="icon-bar"></span><span class="icon-bar"></span><span class="icon-bar"></span></a></div>
</div>
</div>
<!-- #header -->
<div id="header-responsive"><div id="header-responsive-inner" class="responsive-links nav-collapse collapse"><ul id="menu-main-navigation-1" class=""><li id="res-menu-item-71" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item"><a href="https://example.com/"><span>Home</span></a></li>
<li id="res-menu-item-70" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/about-us/"><span>About Us</span></a></li>
<li id="res-menu-item-108" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/services/"><span>Services</span></a></li>
<li id="res-menu-item-124" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/api/"><span>API</span></a></li>
<li id="res-menu-item-68" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/contact-us/"><span>Contact Us</span></a></li>
</ul></div></div>
<div id="header-sticky">
<div id="header-sticky-core">
<div id="logo-sticky">
<a href="https://www.example.com/" class="custom-logo-link" rel="home" itemprop="url"><img width="253" height="50" src="https://www.example.com/logo.png" class="custom-logo" alt="Logo" itemprop="logo" /></a> </div>
<div id="header-sticky-links" class="main-navigation">
<div id="header-sticky-links-inner" class="header-links">
<ul id="menu-main-navigation-2" class="menu"><li id="menu-item-71" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-home current-menu-item page_item page-item-2 current_page_item"><a href="https://www.example.com/"><span>Home</span></a></li>
<li id="menu-item-70" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/about-us/"><span>About Us</span></a></li>
<li id="menu-item-108" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/services/"><span>Services</span></a></li>
<li id="menu-item-124" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/api/"><span>API</span></a></li>
<li id="menu-item-68" class="menu-item menu-item-type-post_type menu-item-object-page"><a href="https://www.example.com/contact-us/"><span>Contact Us</span></a></li>
</ul>
奇怪的是 - 当我打电话给以下一行时:
text = "".join(tree.xpath("//body//*[not(@class='menu')]//text()")).strip()
它按原样返回整个纯文本源代码(即使使用class="text"
元素中的文本)。
但是,当我删除 not
关键字时:
text = "".join(tree.xpath("//body//*[(@class='menu')]//text()")).strip()
...它正确识别class="text"
元素中的文本并完美隔离其文本:
Main Line: +000-000-0000
Sales: +1 000-000-0000
Email: info@example.com
Home
About Us
Services
API
Contact Us
Home
About Us
Services
API
Contact Us
我做错了什么?我希望它能从除class='menu'
之外的所有元素返回文本。
答案 0 :(得分:0)
它返回整个纯文本源代码
您需要清楚XPath表达式SELECTS与处理XPath结果DISPLAYS的应用程序之间的区别。
XPath返回一组节点,调用应用程序通常通过显示以该节点为根的整个子树来显示每个节点,这是非常常见的做法。但这不是XPath所做的;它是调用应用程序。您的选择标准确定XPath表达式选择了哪些节点,但它们不会影响调用应用程序显示这些选定节点的哪些后代。