Question

您好，我有一些来自此网站的html文件：https://www.oddsportal.com/soccer/argentina/superliga/results/

<td class="name table-participant">
  <a href="/soccer/argentina/superliga/independiente-san-martin-tIuN5Umrd/">
    <span class="bold">Independiente</span>
    "- San Martin T."
  </a>
</td>

<td class="name table-participant">
  <a href="/soccer/argentina/superliga/lanus-huracan-xIDIe0Gr/">
    "Lanus - " 
    <span class="bold">Huracan</span>
  </a>
</td>

<td class="name table-participant">
  <a href="/soccer/argentina/superliga/rosario-central-colon-santa-fe-Q1Ye9Jpr/">Rosario Central - Colon Santa FE</a>
</td>

我想选择并加入a / text（）和span / text（）以便看起来像这样：“独立者-圣马丁T。” 如您所见，“跨度”并非总是在同一个地方，有时会丢失（请参阅最后的“ td类”）

我使用了以下代码：

('//td[@class="name table-participant"]/a/text() | span/text()').extract()

，但它仅返回a / text（）。你能帮我做这个工作吗？谢谢

Answer 1

You trying to search span/text() without a scope. Add // at the beginning of this part of query, in the totally:

('//td[@class="name table-participant"]/a/text() | //span/text()').extract()

But I'm strongly recommend use this decision:

 ('//td[@class="name table-participant"]//*[self::a/ancestor::td or self::span]/text()').extract

for get span only from your choiced td-scope.

Answer 2

我假设您正在使用Scrapy刮擦HTML。

从示例HTML的结构来看，您似乎想要获取anchor元素的文本，因此您需要对其进行迭代。

只有这样，您才可以剥离并结合锚元素的文本子节点，以获得格式正确的字符串。引号使用不一致会带来更多的复杂性，但以下内容将助您一臂之力。

from scrapy.selector import Selector

HTML="""
<td class="name table-participant">
  <a href="/soccer/argentina/superliga/independiente-san-martin-tIuN5Umrd/">
    <span class="bold">Independiente</span>
    "- San Martin T."
  </a>
</td>

<td class="name table-participant">
  <a href="/soccer/argentina/superliga/lanus-huracan-xIDIe0Gr/">
    "Lanus - "
    <span class="bold">Huracan</span>
  </a>
</td>

<td class="name table-participant">
  <a href="/soccer/argentina/superliga/rosario-central-colon-santa-fe-Q1Ye9Jpr/">Rosario Central - Colon Santa FE</a>
</td>
"""

def strip_and_join(x):
    l=[]
    for s in x:
        # strip whitespace and quotes
        s = s.strip().strip('"').strip()
        # drop now empty strings
        if s:
            l.append(s)
    return " ".join(l)

for x in Selector(text=HTML).xpath('//td[@class="name table-participant"]/a'):
    print strip_and_join(x.xpath('.//text()').extract())

请注意，为了清楚起见，我没有将代码压缩到单个列表理解中，尽管这当然是可能的。

xpath从多个元素python

2 个答案: