Question

我正在使用代理来抓取此网址的数据：CNN Article

我想获得整篇文章（不一定是标题）。所以我尝试了这个：

$(data).find("div:contains('Across the river from Cairo')");

这会找到这段文字，但是当我用它来做我的事情myThing = $(this).text();它似乎比文章更多。这可能与HTML的构建方式有关。如果我查看来源，我会看到文章文字仅限于p但是，将div:contains更改为p:contains只能获得前几行（显然）

所以我的问题是如何获得文章文本，无论它是HTML结构。我正在寻找会说：

的东西（代码）

find.('Across the river from Cairo') and get this text and all the text underneath this text();

Answer 1

我使用选择器p.cnn_storypgraphtxt从该文章中获得了所需的结果。要获得整篇文章，您可以使用$("p.cnn_storypgraphtxt").text()或

$("p.cnn_storypgraphtxt").map(function(){return $(this).text;}).get().join("\n");

要获取某个表达式后面的文本，可以使用.last()获取最后选择的节点（即DOM中最下面的节点），然后使用.nextAll()

$(":contains('Across the river from Cairo')").last().nextAll().text()

但这会包含很多不需要的东西。

Answer 2

尝试使用

$someString = $(data).find("div:contains('Across the river from Cairo')").html();

使用该字符串进行操作或其他任何操作。