Question

我正在尝试从以下评论中的网页“07/18/16”中提取日期。我不清楚xpath的语法，你怎么会抓住日期？

#<p>Opened <a class="timeline" href="/trac3/timeline?from=2016-07-    
#18T14%3A46%3A43-04%3A00&amp;precision=second" title="See timeline at   
#07/18/16 14:46:43">6 weeks ago</a></p>

from lxml import html
import requests

page = requests.get(webpage)
tree = html.fromstring(page.content)

openDate = tree.xpath('//Opened/text()')

print 'Open Date: ', openDate

Answer 1

喜欢这个吗？

import re
from lxml import html

data = """<p>Opened <a class="timeline" href="/trac3/timeline?from=2016-07-18T14%3A46%3A43-04%3A00&amp;precision=second" title="See timeline at 07/18/16 14:46:43">6 weeks ago</a></p>"""

tree = html.fromstring(data)
try:
    href = tree.xpath("//a[@class='timeline']/@href")[0]
    openDate = re.search(r'from=(\d+-\d+-\d+)', href).group(1)
    print('Open Date: ', openDate)
    # Open Date:  2016-07-18
except:
    print("Something went wrong")

首先获取@href属性，然后使用正则表达式对其进行分析。

<小时/> 再次阅读问题后，您可能更愿意寻找 title属性：

try:
    href = tree.xpath("//a[@class='timeline']/@title")[0]
    openDate = re.search(r'\d+/\d+/\d+', href).group(0)
    print('Open Date: ', openDate)
    # Open Date:  07/18/16
except:
    print("Something went wrong")

Answer 2

这里只使用xpath 1.0：

substring-before(substring-after(normalize-space(//a[contains(concat(' ',normalize-space(@class),' '),' timeline ')]/@title),'See timeline at '), ' ')

contains(concat(' ',normalize-space(@class),' '),' timeline ')可能看起来有点矫枉过正，但会考虑到＆＃34;时间线＆＃34;以外的课程的可能性。属于类属性。

XPath测试：http://www.xpathtester.com/xpath/7805b0601b1468ea17209127e14fa470

lxml示例

from lxml import html

page = """<p>Opened <a class="timeline" href="/trac3/timeline?from=2016-07-18T14%3A46%3A43-04%3A00&amp;precision=second" title="See timeline at 07/18/16 14:46:43">6 weeks ago</a></p>"""
tree = html.fromstring(page)

try:
    openDate = tree.xpath("substring-before(substring-after(normalize-space(//a[contains(concat(' ',normalize-space(@class),' '),' timeline ')]/@title),'See timeline at '), ' ')")
    print 'Open Date: ', openDate
    #Open Date: 07/18/16
except:
    print("Something went wrong")

Answer 3

XPath通过匹配XML结构化文档中的元素来工作。

你的XPath会失败，因为你所说的是在整个文档（“//”）中搜索任何名为“Opened”的元素（即<Opened/>）并返回它们的内部文本（“text（）”））。

假设您的HTML是一致的，那么您实际想要做的就是抓取日期的锚标题的内容，如下所示：

//p[contains(text(),'Opened')]/a[@class='timeline']/@title

这将搜索整个文档中任何属于“时间轴”类的锚点，并且这些锚点位于包含单词“Opened”的段落中，并返回其“title”属性的内容。

注意我说“任何锚”;您的结果将是匹配标题的列表，因此如果您有多个匹配项，您需要决定该怎么做。

一旦你有了标题，你就需要在python中做一些字符串切片来检索日期部分。

我假设它只是XPath你正在努力，所以我遗漏了任何python示例。我建议将此站点作为XPath的一个很好的起点：http://dh.obdurodon.org/introduction-xpath.xhtml

Answer 4

你不能这样做。 Xpath直接选择标签，而不是其中的字段。所以＆＃34; // p / a [text（）]＆＃34;全部归还<a class="timeline" href="/trac3/timeline?from=2016-07-18T14%3A46%3A43-04%3A00&precision=second" title="See timeline at 07/18/16 14:46:43">6 weeks ago</a> 或者您可以选择条件，如＆＃34; // p / a [text（）=＆＃34; 6周前＆＃34;]＆＃34; 所以得到这个<a></a>标签，然后用python

解析它

使用xpath提取日期

4 个答案: