Question

所以我不知道如何继续这里。我有一个我试图抓住的页面的例子：

http://www.yonhapnews.co.kr/sports/2015/06/05/1001000000AKR20150605128600007.HTML?template=7722

现在我有xpath选择'article'div类然后随后<p>我可以随时消除第一个因为它是相同的股票新闻文本（city，yonhapnews，reporter等）我正在评估单词密度，所以这对我来说可能是一个问题：（

问题出现在文章的最后。如果你看到最后有一个记者的电子邮件地址和发布的日期和时间......

问题在于，在本网站的不同页面上，最后有不同数量的<p>标签，因此我不能只消除最后两个标签，因为它有时会混淆我的结果。

你将如何在最后消除那些特定的<p>元素？我之后只需要尝试擦除我的数据吗？

以下是选择路径并删除第一个<p>和后两个的代码段。我该怎么改变它？

# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[@class="article"]/p//text()').re(ur'[\uac00-\ud7af]+')

# For yonhapnews the first and the last two <p>'s are useless, everything else should be good
hangul_syllables = hangul_syllables[1:-2]

Answer 1

您可以调整XPath表达式，使其不包含String ID_PATTERN = "\\p{javaJavaIdentifierStart}\\p{javaJavaIdentifierPart}*"; Pattern FQCN = Pattern.compile(ID_PATTERN + "(\\." + ID_PATTERN + ")*");标记p（发布日期）：

class="adrs"

Answer 2

添加到alecxe的答案，您可以使用检查电子邮件地址（可能被空格包围）的内容排除包含电子邮件地址的p。如何做到这取决于你是拥有XPath 2.0还是1.0。在2.0中，您可以执行以下操作：

//*[@class="article"]/p[not(contains(@class, "adrs")
       or text()[matches(normalize-space(.),
                   "^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$", "i")])]//text()

根据http://docs.silverstripe.org/en/3.1/developer_guides/model/data_model_and_orm/#filterany修改电子邮件地址的正则表达式。如果您愿意，可以将\.[A-Z]{2,4}更改为\.kr。

刮刮时如何消除某些元素？

2 个答案: