Question

我试图在维基百科页面上的段落中查找所有超链接的周围文本，以及我这样做的方式涉及使用xpath tree.xpath("//p/node()")。在大多数链接上工作正常，我能够找到大多数<Element a at $mem_location$>的内容。但是，如果超链接是斜体（请参阅下面的示例），则xpath node()仅将其视为<Element i at $mem_location>，并且看起来不会更深。

这导致我的代码错过超链接，并且会破坏页面其余部分的索引。

例如：

<p>The closely related term, <a href="/wiki/title="Mange">mange</a>,
is commonly used with <a href="/wiki/Domestic_animal" title="Domestic animal" class="mw-redirect">domestic animals</a> 
(pets) and also livestock and wild mammals, whenever hair-loss is involved. 

<i><a href="/wiki/Sarcoptes" title="Sarcoptes">Sarcoptes</a></i> 
and <i><a href="/wiki/Demodex" title="Demodex">Demodex</a></i> 
species are involved in mange, both of these genera are also involved in human skin diseases (by 
convention only, not called mange). <i>Sarcoptes</i> in humans is especially 
severe symptomatically, and causes the condition known as 
<a href="/wiki/Scabies" title="Scabies">scabies</a>.</p>

node()抓住＆＃34; Mange＆＃34;，＆＃34;家畜＆＃34;和＆＃34;疥疮＆＃34;适当的，但几乎跳过＆＃34; Sarcoptes＆＃34;和＃34; Demodex＆＃34;并且搞砸索引，因为我过滤掉了<Element a at $mem_location$>而不是<Element i at $mem_location$>的节点。

有没有办法让node()更深入了解？我在文档中找不到任何关于它的内容。

编辑：我的xpath现在是"//p/node()"，但它只抓取最外面的元素图层。大部分时间它是<a>，这很棒，但如果它被包裹在<i>层中，它只能抓住它。我问是否可以更深入地检查，以便我可以在<a>包装中找到<i>。

相关代码如下： tree = etree.HTML（读取）

titles = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/@title')) #extracts the titles of all hyperlinks in section paragraphs
hyperlinks = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/text()'))
b = list(tree.xpath("//p/b/text()")) #extracts all bolded words in section paragraphs
t = list(tree.xpath("//p/node()"))

b_count = 0
a_count = 0
test = []
for items in t:
print items
items = str(items)
if "<Element b" in str(items):
  test.append(b[b_count])
  b_count += 1
  continue
if "<Element a" in str(items):
  test.append((hyperlinks[a_count],titles[a_count]))
  a_count +=1
  continue

if "<Element " not in items:
  pattern = re.compile('(\t(.*?)\n)')
  look = pattern.search(str(items))

  if look != None: #if there is a match
    test.append(look.group().partition("\t")[2].partition("\n")[0])

  period_pattern = re.compile("(\t(.*?)\.)")
  look_period = period_pattern.search(str(items))
  if look_period != None:
    test.append(look_period.group().partition("\t")[2])

Answer 1

我想不出可以做到这一点的直接xpath，但是你总是可以遍历内容并过滤掉这样的元素 -

for i,x in enumerate(t):
    if x.tag == i:
        aNodes = x.find('a')
        if aNodes is not None and len(aNodes) > 0:
            del t[i]
            for j, y in enumerate(x.findall('/nodes()')): #doing x.findall to take in text elements as well as a elements.
                t.insert(i+j,y)

这也可以处理单个a内的多个i，例如<i><a>something</a><a>blah</a></i>

更深入地了解xpath node（）

1 个答案: