XPath:获取所有没有特定@class或@id名称的元素

时间:2019-12-22 12:23:08

标签: xpath scrapy xpath-1.0

我已经很沮丧了。我已经尝试了许多变体并在所有现有的stackoverflow问题中寻找答案,但这并没有帮助。

我所需要的只是获取所有文本没有@class名称'menu'或没有@id名称'menu') 我已经尝试过以下命令:

//*[not(descendant-or-self::*[(contains(@id, 'menu')) or (contains(@class, 'menu'))])]/text()[normalize-space()]

但是无论我如何尝试,我总是会得到所有文本,即使包含我排除的元素

Ps:我正在使用使用XPATH 1.0的Scrapy

<body>
  <div id="top">
    <div class="topHeader">
      <div class="topHeaderContent">
        <a class="headerLogo" href="/Site/Home.de.html"></a>
        <a class="headerText" href="/Site/Home.de.html"></a>
        <div id="menuSwitch"></div>
      </div>
    </div>

    <div class="topContent">
      <div id="menuWrapper">
        <nav>
          <ul class="" id="menu"><li class="firstChild"><a class="topItem" href="/Site/Home.de.html">Home</a>     </li>
            <li class="hasChild"><span class="topItem">Produkte</span><ul class=" menuItems"><li class=""><a href="/Site/Managed_Services.de.html">Managed Services</a>             </li>
              <li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a>                </li>
              <li class=""><a href="/Site/DMB/Apps.de.html">Mobile Publishing</a>             </li>
              <li class=""><a href="/Site/Broadcasting.de.html">Broadcasting</a>              </li>
              <li class=""><a href="/Site/Content_Management.de.html">Content Management</a>      </li>
            </ul>
          </li>
          <li class="hasChild"><span class="topItem">Digital Media Base</span><ul class=" menuItems"><li class=""><a href="/Site.de.html">About DMB</a>           </li>
            <li class=""><a href="/Site/DMB/Quellen.de.html">Quellen</a>            </li>
            <li class=""><a href="/Site/DMB/Video.de.html">Video</a>                </li>
            <li class=""><a href="/Site/DMB/Apps.de.html">Apps</a>          </li>
            <li class=""><a href="/Site/DMB/Web.de.html">Web</a>            </li>
            <li class=""><a href="/Site/DMB/Archiv.de.html">Archiv</a>              </li>
            <li class=""><a href="/Site/DMB/Social_Media.de.html">Social Media</a>          </li>
            <li class=""><a href="/Site/DMB/statistik.de.html">Statistik</a>                </li>
            <li class=""><a href="/Site/DMB/Payment.de.html">Payment</a>            </li>
          </ul>
        </li>
        <li class="activeMenu "><a class="topItem" href="/Site/Karriere.de.html">Karriere</a>           </li>
        <li class="hasChild"><span class="topItem">Fake-IT</span><ul class=" menuItems"><li class=""><a href="/Site/About.de.html">About</a>             </li>
          <li class=""><a href="/Site/Management.de.html">Management</a>          </li>
          <li class=""><a href="/Site/Mission_Statement.de.html">Mission Statement</a>        </li>
          <li class=""><a href="/Site/Pressemeldungen.de.html">Pressemeldungen</a>            </li>
          <li class=""><a href="/Site/Referenzen.de.html">Kunden</a>              </li>
        </ul>
      </li>
    </ul>
  </nav>
  <div class="topSearch">
    <div class="topSearch">
      <form action="/Site/Suchergebnis.html" method="get">
        <form action="/Site/Suchergebnis.html" method="get">
          <input class="searchText" onblur="processSearch(this, &quot;Suchbegriff&quot;, &quot;blur&quot;)" onfocus="processSearch(this,&quot;Suchbegriff&quot;)" type="text" value="Suchbegriff" name="searchTerm" id="searchTerm" />
          <input class="searchSubmit" id="js_searchSubmit" type="submit" name="yt0" />
          <div class="stopFloat">
          </div>
        </form>
      </div>
    </div>
  </div>
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</div>
<p> I want to have this text here! </p>
.
.
More elements
.
.
</body>

我总是回来:

['Home',
 'Produkte',
 'Managed Services',
 'VideoServices',
 'Mobile Publishing',
 'Broadcasting',
 'Content Management',
 'Digital Media Base',
 'About DMB',
 'Quellen',
 'Video',
 'Apps',
 'Web',
 'Archiv',
 'Social Media',
 'Statistik',
 'Payment',
 'Karriere',
 'Fake-IT',
 'About',
 'Management',
 'Mission Statement',
 'Pressemeldungen',
 'Kunden',
 ' I want to have this text here! ',
 ' I want to have this text here! ']

但是我需要这样:

[' I want to have this text here! ',
 ' I want to have this text here! ']

3 个答案:

答案 0 :(得分:2)

这个非常复杂的xpath 1.0表达式适用于示例html。在xpath 2.0及更高版本中,它会稍微简单一些。但是,请在您的实际代码上尝试一下:

 //*[not(descendant-or-self::*[contains(@class,'menu')])]
 [not(descendant-or-self::*[contains(@id,'menu')])]
 [not(ancestor-or-self::*[contains(@class,'menu')])]
 [not(ancestor-or-self::*[contains(@id,'menu')])]//text()

答案 1 :(得分:0)

好吧,如果您考虑元素

<li class=""><a href="/Site/DMB/Video.de.html">VideoServices</a> </li>

它是某物的后代或自身,它没有相关的id或class属性,因此当然会被选中。

也许您想要//*[not(ancestor-or-self::*[@id='menu' or @class='menu']]

您写了“包含”,但是我不确定您是否真的是这样。很多人在应该使用“ =”时使用contains()

答案 2 :(得分:0)

您可以直接在scrapy lxml树中迭代标签,就像在此代码示例中一样:

result = []
for tag in response.css("*"):
    if 'id' not in tag.attrib and 'class' not in tag.attrib and 'href' not in tag.attrib:
        text = tag.css("::text").extract_first("").strip("\n ")
        if text:
            result.append(tag.css("::text").extract_first())

如您所见,我还排除了具有href属性的标签,如<a>标签,如下所示:
<a href="/Site/DMB/Video.de.html">VideoServices</a> 没有classid属性,从技术上讲,它们没有违反您的Xpath表达式。
因此,如果您打算使用Xpath选择器-您还需要排除具有href属性的标签。