我已经很沮丧了。我已经尝试了许多变体并在所有现有的stackoverflow问题中寻找答案,但这并没有帮助。
我所需要的只是获取所有文本(没有@class名称'menu'或没有@id名称'menu') 我已经尝试过以下命令:
//*[not(descendant-or-self::*[(contains(@id, 'menu')) or (contains(@class, 'menu'))])]/text()[normalize-space()]
但是无论我如何尝试,我总是会得到所有文本,即使包含我排除的元素。
Ps:我正在使用使用XPATH 1.0的Scrapy
<body>
<div id="top">
<div class="topHeader">
<div class="topHeaderContent">
<a class="headerLogo" href="/Site/Home.de.html"></a>
<a class="headerText" href="/Site/Home.de.html"></a>
<div id="menuSwitch"></div>
</div>
</div>
<div class="topContent">
<div id="menuWrapper">
<nav>
<ul class="" id="menu"><li class="firstChild"><a class="topItem" href="/Site/Home.de.html">Home</a> </li>
<li class="hasChild"><span class="topItem">Produkte</span><ul class=" menuItems"><li class=""><a href="/Site/Managed_Services.de.html">Managed Services</a> </li>
<li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a> </li>
<li class=""><a href="/Site/DMB/Apps.de.html">Mobile Publishing</a> </li>
<li class=""><a href="/Site/Broadcasting.de.html">Broadcasting</a> </li>
<li class=""><a href="/Site/Content_Management.de.html">Content Management</a> </li>
</ul>
</li>
<li class="hasChild"><span class="topItem">Digital Media Base</span><ul class=" menuItems"><li class=""><a href="/Site.de.html">About DMB</a> </li>
<li class=""><a href="/Site/DMB/Quellen.de.html">Quellen</a> </li>
<li class=""><a href="/Site/DMB/Video.de.html">Video</a> </li>
<li class=""><a href="/Site/DMB/Apps.de.html">Apps</a> </li>
<li class=""><a href="/Site/DMB/Web.de.html">Web</a> </li>
<li class=""><a href="/Site/DMB/Archiv.de.html">Archiv</a> </li>
<li class=""><a href="/Site/DMB/Social_Media.de.html">Social Media</a> </li>
<li class=""><a href="/Site/DMB/statistik.de.html">Statistik</a> </li>
<li class=""><a href="/Site/DMB/Payment.de.html">Payment</a> </li>
</ul>
</li>
<li class="activeMenu "><a class="topItem" href="/Site/Karriere.de.html">Karriere</a> </li>
<li class="hasChild"><span class="topItem">Fake-IT</span><ul class=" menuItems"><li class=""><a href="/Site/About.de.html">About</a> </li>
<li class=""><a href="/Site/Management.de.html">Management</a> </li>
<li class=""><a href="/Site/Mission_Statement.de.html">Mission Statement</a> </li>
<li class=""><a href="/Site/Pressemeldungen.de.html">Pressemeldungen</a> </li>
<li class=""><a href="/Site/Referenzen.de.html">Kunden</a> </li>
</ul>
</li>
</ul>
</nav>
<div class="topSearch">
<div class="topSearch">
<form action="/Site/Suchergebnis.html" method="get">
<form action="/Site/Suchergebnis.html" method="get">
<input class="searchText" onblur="processSearch(this, "Suchbegriff", "blur")" onfocus="processSearch(this,"Suchbegriff")" type="text" value="Suchbegriff" name="searchTerm" id="searchTerm" />
<input class="searchSubmit" id="js_searchSubmit" type="submit" name="yt0" />
<div class="stopFloat">
</div>
</form>
</div>
</div>
</div>
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</body>
我总是回来:
['Home',
'Produkte',
'Managed Services',
'VideoServices',
'Mobile Publishing',
'Broadcasting',
'Content Management',
'Digital Media Base',
'About DMB',
'Quellen',
'Video',
'Apps',
'Web',
'Archiv',
'Social Media',
'Statistik',
'Payment',
'Karriere',
'Fake-IT',
'About',
'Management',
'Mission Statement',
'Pressemeldungen',
'Kunden',
' I want to have this text here! ',
' I want to have this text here! ']
但是我需要这样:
[' I want to have this text here! ',
' I want to have this text here! ']
答案 0 :(得分:2)
这个非常复杂的xpath 1.0表达式适用于示例html。在xpath 2.0及更高版本中,它会稍微简单一些。但是,请在您的实际代码上尝试一下:
//*[not(descendant-or-self::*[contains(@class,'menu')])]
[not(descendant-or-self::*[contains(@id,'menu')])]
[not(ancestor-or-self::*[contains(@class,'menu')])]
[not(ancestor-or-self::*[contains(@id,'menu')])]//text()
答案 1 :(得分:0)
好吧,如果您考虑元素
<li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a> </li>
它是某物的后代或自身,它没有相关的id或class属性,因此当然会被选中。
也许您想要//*[not(ancestor-or-self::*[@id='menu' or @class='menu']]
您写了“包含”,但是我不确定您是否真的是这样。很多人在应该使用“ =”时使用contains()
。
答案 2 :(得分:0)
您可以直接在scrapy lxml树中迭代标签,就像在此代码示例中一样:
result = []
for tag in response.css("*"):
if 'id' not in tag.attrib and 'class' not in tag.attrib and 'href' not in tag.attrib:
text = tag.css("::text").extract_first("").strip("\n ")
if text:
result.append(tag.css("::text").extract_first())
如您所见,我还排除了具有href
属性的标签,如<a>
标签,如下所示:
<a href="/Site/DMB/Video.de.html">VideoServices</a>
没有class
和id
属性,从技术上讲,它们没有违反您的Xpath表达式。
因此,如果您打算使用Xpath选择器-您还需要排除具有href
属性的标签。