如何剥离从网络收获中获得的部分文本

时间:2013-09-12 07:14:28

标签: java javascript web-scraping screen-scraping webharvest

我是webharvest的新手,正在使用它从网站上获取文章数据,使用以下声明:

let $text := data($doc//div[@id="articleBody"])

这是我从上述声明中得到的数据:

The Refine Spa (Furman's Mill) was built as a stone grist mill along the on a tributary of Capoolong Creek by Moore Furman, quartermaster general of George Washington's army

Notable people

Notable current and former residents of Pittstown include:

我的问题是,是否可以使用配置删除“名人”之后的整个内容。有可能这样做吗?如果有可能请告诉我如何。感谢。

修改 所需的输出:

The Refine Spa (Furman's Mill) was built as a stone grist mill along the on a tributary of Capoolong Creek by Moore Furman, quartermaster general of George Washington's army

Notable people

1 个答案:

答案 0 :(得分:1)

你只需要改变你的let语句,如:

让$ text:= substring-before(data($ doc // div [@ id =“articleBody”] / text()),'知名人士')

获得所需的输出