Question

我正在使用XPath抓取网站，除日期外，我已经能够访问所需的更多信息。日期是div中的文本，其格式如下。

2018年10月13日上午1:31 / 5小时前更新

我只想获取日期，而不是时间或其他任何东西。但是，使用当前代码，我将在div中获取整个文本。我的代码在下面。

item['datePublished'] = response.xpath("//div[contains(@class, 'ArticleHeader_date') and substring-before(., '/')]/text()").extract()

Answer 1

如前所述，在XPath 2.0+中有多种方法可以做到这一点。但是，这应该以宿主语言完成。

一种方法是在检索到值后，使用正则表达式提取日期，例如Regex Demo

\w+\ \d\d?,\ \d{4}

Code Sample：

import re
regex = r"\w+\ \d\d?,\ \d{4}"
test_str = "October 13, 2018 / 1:31 AM / Updated 5 hours ago"
matches = re.search(regex, test_str)
if matches:
    print (matches.group())

在特定字符[xpath]之后获取div中的文本

1 个答案: