我正在使用Web连接器从站点(https://www.silverhavenjewellery.com/categories/silver-jewellery-designs.html)抓取数据。该网页包含许多嵌套在body标签内的项目(div,ul,li等)。从Lucidworks文档中,我发现内置的HTML解析器仅从以下标记中抓取数据:标题,元,a,链接和主体(但不包括主体的子代)。这是页面:https://doc.lucidworks.com/fusion-server/4.1/reference-guides/parser-stages/html-parser.html。为了克服这个问题,我遵循了Lucidworks博客文章https://lucidworks.com/2017/01/24/extracting-values-from-element-attributes-using-jsoup-and-a-javascript-stage/的说明。请参阅下面我用来尝试抓取body标签的子元素的JavaScript代码。请注意,当我将以下代码行中的“ div”更改为“ body”时,代码按预期工作。如果我搜索体内的任何子元素,它甚至都不会意识到它们的存在。任何克服这一问题的帮助将不胜感激
divs = jdoc.select("div");
function(doc){
var File = java.io.File;
var Iterator = java.util.Iterator;
var Jsoup = org.jsoup.Jsoup;
var Document = org.jsoup.nodes.Document;
var Element = org.jsoup.nodes.Element;
var Elements = org.jsoup.select.Elements;
var content = doc.getFirstFieldValue("body");
var jdoc = org.jsoup.nodes.Document;
var e = java.lang.Exception;
var div = org.jsoup.nodes.Element;
var img = org.jsoup.nodes.Element;
var iter = java.util.Iterator;
var divs = org.jsoup.select.Elements;
var counter = 1;
try {
jdoc = Jsoup.parse(content);
divs = jdoc.select("div");
iter = divs.iterator();
div = null; // initialize our value to null
while (iter.hasNext()) {
doc.addField(counter, "woohoo"); //for debugging purposes
div = iter.next();
counter = counter++;
}
if (div != null) {
doc.addField("found it", "woohoo");//for debugging purposes
} else {
doc.addField("error", "this is an error");//for debugging purposes
logger.warn("Div was null");
}
} catch ( e) {
logger.error(e);
}
return doc;
}