如何使用Lucidworks Fusion 4.1从HTML页面内抓取数据

时间:2019-01-16 01:59:53

标签: javascript html-parsing html-parser lucidworks

我正在使用Web连接器从站点(https://www.silverhavenjewellery.com/categories/silver-jewellery-designs.html)抓取数据。该网页包含许多嵌套在body标签内的项目(div,ul,li等)。从Lucidworks文档中,我发现内置的HTML解析器仅从以下标记中抓取数据:标题,元,a,链接和主体(但不包括主体的子代)。这是页面:https://doc.lucidworks.com/fusion-server/4.1/reference-guides/parser-stages/html-parser.html。为了克服这个问题,我遵循了Lucidworks博客文章https://lucidworks.com/2017/01/24/extracting-values-from-element-attributes-using-jsoup-and-a-javascript-stage/的说明。请参阅下面我用来尝试抓取body标签的子元素的JavaScript代码。请注意,当我将以下代码行中的“ div”更改为“ body”时,代码按预期工作。如果我搜索体内的任何子元素,它甚至都不会意识到它们的存在。任何克服这一问题的帮助将不胜感激

divs = jdoc.select("div");

function(doc){
    
	var File = java.io.File;
var Iterator = java.util.Iterator;
var Jsoup = org.jsoup.Jsoup;
var Document = org.jsoup.nodes.Document;
var Element =  org.jsoup.nodes.Element;
var Elements = org.jsoup.select.Elements;

var content = doc.getFirstFieldValue("body");
var jdoc = org.jsoup.nodes.Document;
var e = java.lang.Exception;
var div = org.jsoup.nodes.Element;
var img = org.jsoup.nodes.Element;
var iter = java.util.Iterator;
var divs = org.jsoup.select.Elements;
var counter = 1;


  
   
   try {
             jdoc = Jsoup.parse(content);
             divs = jdoc.select("div");
             iter = divs.iterator();
             div = null; // initialize our value to null
            while (iter.hasNext()) {
              
              	doc.addField(counter, "woohoo"); //for debugging purposes
                div = iter.next();
              counter = counter++;
            }

            if (div != null) {
                doc.addField("found it", "woohoo");//for debugging purposes
            } else {
              doc.addField("error", "this is an error");//for debugging purposes
                logger.warn("Div was null");
            }

        } catch ( e) {
           logger.error(e);
        }

    return doc;
}

0 个答案:

没有答案