从html javascript nodejs中提取数据?

时间:2012-03-20 01:05:49

标签: javascript html parsing node.js

我正在为自己制作一个用于Google搜索的CLI,我在聊天机器人上工作时已经使用了nodejs,所以我想让它与nodejs一起工作。我可以很好地提取数据,最后得到一个包含页面中所有html数据的字符串。在html中很容易理解我想要的结果是什么:

<div class="jd"><a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/&amp;ved=0CBAQFjAA&amp;usg=AFQjCNEEnWGHwxNnuwKenqm4ajKfTM6Xxw" ><b>League of Legends</b> - Free Online Game | <b>LoL</b> - <b>League of Legends</b></a> </div> <div class="kd">3 days ago&nbsp;… Official website for <b>League of Legends</b>. Join millions of players in an award   winning Multiplayer Online Battle Arena. </div> <div class="qdlmxn"><a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/board&amp;ved=0CBEQ0gIoADAA&amp;usg=AFQjCNHpmmAdFFbTgm8C_gJvsjVhMzVKUQ" >Community</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://signup.leagueoflegends.com/en/signup/redownload&amp;ved=0CBIQ0gIoATAA&amp;usg=AFQjCNFHGUtn4ItgQIzODgIZRv_237Mq0A" >PVP.NET</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/board/forumdisplay.php?f%3D2&amp;ved=0CBMQ0gIoAjAA&amp;usg=AFQjCNHpycJ8WGh7xvWw1qNu8NjjU1EA0Q" >General Discussion</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/champions&amp;ved=0CBQQ0gIoAzAA&amp;usg=AFQjCNEpeBzNefwag5xmkFcFhCW27FoAew" >Champions</a> </div><span class="c">leagueoflegends.com/</span> -  <div class="txnles" onclick="_popup('web_result_popup_10836585','inline');"> <div class="wx4xyp" id="web_result_popup_10836585"> <div class="vfc7iu"><a class="s" href="/search?q=cache:GCRD1wy5e3QJ:leagueoflegends.com/" >Cached</a> <br/><a class="s" href="/m/?q=related:leagueoflegends.com/&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBYQHzAA" >Similar</a> <br/><a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://leagueoflegends.com/" >Mobile formatted</a> </div> </div><a class="s" href="javascript:void(0)" >Options</a> <div class="m6u8fq"> </div> </div> </div> </div> <div> <div class="r ld"> <div class="jd"><a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://en.wikipedia.org/wiki/LOL&amp;ved=0CBgQFjAB&amp;usg=AFQjCNFOhgg5Y2E5SFuS5I-8830OJ9VR9Q" ><b>LOL</b> - Wikipedia, the free encyclopedia</a> </div> <div class="kd"><b>LOL</b>, an abbreviation for <b>laughing out loud</b>, or <b>laugh out loud</b>, is a common   element of Internet slang. It was used historically on Usenet&nbsp;… </div><span class="c">en.wikipedia.org/wiki/LOL</span> -  <div class="txnles" onclick="_popup('web_result_popup_30597472','inline');"> <div class="wx4xyp" id="web_result_popup_30597472"> <div class="vfc7iu"><a class="s" href="/search?q=cache:mhIpOeXQp38J:en.wikipedia.org/wiki/LOL" >Cached</a> <br/><a class="s" href="/m/?q=related:en.wikipedia.org/wiki/LOL&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBkQHzAB" >Similar</a> <br/><a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://en.wikipedia.org/wiki/LOL" >Mobile formatted</a> </div> </div><a class="s" href="javascript:void(0)" >Options</a> <div class="m6u8fq"> </div> </div> </div> </div> <div> <div class="r ld">

任何.jd都是结果,所以我首先需要将它们分开,然后再将URL和描述分开。我从来没有对这种极端做过字符串操作,所以我不知道从哪里开始。

这是一个更易读的格式的html,虽然我只处理了一个长字符串。

<div>
  <div class="r ld">
    <div class="jd">
      <a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/&amp;ved=0CBAQFjAA&amp;usg=AFQjCNEEnWGHwxNnuwKenqm4ajKfTM6Xxw" >
        <b>League of Legends</b> - Free Online Game | <b>LoL</b> - <b>League of Legends</b>
      </a>
    </div>
    <div class="kd">
      3 days ago&nbsp;… Official website for <b>League of Legends</b>. Join millions of players in an award   winning Multiplayer Online Battle Arena. 
    </div>
    <div class="qdlmxn">
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/board&amp;ved=0CBEQ0gIoADAA&amp;usg=AFQjCNHpmmAdFFbTgm8C_gJvsjVhMzVKUQ" >Community</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://signup.leagueoflegends.com/en/signup/redownload&amp;ved=0CBIQ0gIoATAA&amp;usg=AFQjCNFHGUtn4ItgQIzODgIZRv_237Mq0A" >PVP.NET</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/board/forumdisplay.php?f%3D2&amp;ved=0CBMQ0gIoAjAA&amp;usg=AFQjCNHpycJ8WGh7xvWw1qNu8NjjU1EA0Q" >General Discussion</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/champions&amp;ved=0CBQQ0gIoAzAA&amp;usg=AFQjCNEpeBzNefwag5xmkFcFhCW27FoAew" >Champions</a> 
    </div>
    <span class="c">leagueoflegends.com/</span> -  
    <div class="txnles" onclick="_popup('web_result_popup_10836585','inline');">
      <div class="wx4xyp" id="web_result_popup_10836585"> <div class="vfc7iu">
        <a class="s" href="/search?q=cache:GCRD1wy5e3QJ:leagueoflegends.com/" >Cached</a> 
        <br/>
        <a class="s" href="/m/?q=related:leagueoflegends.com/&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBYQHzAA" >Similar</a> 
        <br/>
        <a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://leagueoflegends.com/" >Mobile formatted</a> 
      </div> 
    </div>
    <a class="s" href="javascript:void(0)" >Options</a> 
    <div class="m6u8fq"> </div> 
    </div> 
  </div> 
</div> 
<div> 
  <div class="r ld"> 
    <div class="jd">
      <a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://en.wikipedia.org/wiki/LOL&amp;ved=0CBgQFjAB&amp;usg=AFQjCNFOhgg5Y2E5SFuS5I-8830OJ9VR9Q" >
        <b>LOL</b> - Wikipedia, the free encyclopedia
      </a>
    </div>
    <div class="kd">
      <b>LOL</b>, an abbreviation for <b>laughing out loud</b>, or <b>laugh out loud</b>, is a common   element of Internet slang. It was used historically on Usenet&nbsp;… 
    </div>
    <span class="c">en.wikipedia.org/wiki/LOL</span> -  
    <div class="txnles" onclick="_popup('web_result_popup_30597472','inline');"> 
      <div class="wx4xyp" id="web_result_popup_30597472"> 
        <div class="vfc7iu">
          <a class="s" href="/search?q=cache:mhIpOeXQp38J:en.wikipedia.org/wiki/LOL" >Cached</a> 
          <br/>
          <a class="s" href="/m/?q=related:en.wikipedia.org/wiki/LOL&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBkQHzAB" >Similar</a> 
          <br/>
          <a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://en.wikipedia.org/wiki/LOL" >Mobile formatted</a> 
        </div> 
      </div>
      <a class="s" href="javascript:void(0)" >Options</a> 
      <div class="m6u8fq"> </div> 
    </div> 
  </div> 
</div>

2 个答案:

答案 0 :(得分:0)

Luxun引导我回到google API,我发现了如何搜索完整的网络搜索结果,而不是仅仅包含网站:http://support.google.com/customsearch/bin/answer.py?hl=en&answer=1210656

To create a search engine that searches the entire web:

From the Google Custom Search homepage, click Create a Custom Search Engine.
Type a name and description for your search engine.
Under Define your search engine, in the Sites to Search box, enter at least one valid URL (e.g. www.google.com).
Select the CSE edition you want and accept the Terms of Service, then click Next. Select the layout option you want, and then click Next.
Click any of the links under the Next steps section to navigate to your Control panel.
In the left-hand menu, under Control Panel, click Basics.
In the Search Preferences section, select Search the entire web but emphasize included sites.
Click Save Changes.
In the left-hand menu, under Control Panel, click Sites.
Delete the site you entered during the initial setup process.

答案 1 :(得分:0)

所以我有点相同的情况但是使用xml,我根据自己的需要创建了这个html / xml解析器。

我试图让它看起来像DOM操作的浏览器模型,所以大多数事情的工作方式相同

首先,节点只需在js文件中复制粘贴。别忘了宣布“使用严格”;在文件的开头。

class Node {
    constructor(nodeName, nodeType) {
        this.nodeName = nodeName;

        this.nodeType = nodeType;
        this.attributes = {};
        this.childNodes = [];
        this.parentNode = null;


    }

    removeChild(node) {
        if (node.parentNode != null) {
            for (var i = 0; i < this.childNodes.length; i++) {
                if (node == this.childNodes[i]) {
                    this.childNodes.splice(i, 1);
                    node.parentNode = null;
                }
            }
        }
    }

    appendChild(child) {
        if (child.parentNode == null) {
            this.childNodes.push(child);
            child.parentNode = this;

        } else {
            child.parentNode.removeChild(child);
            this.childNodes.push(child);
            child.parentNode = this;

        }
    }

    returnMyChildNodes() {
        return this.childNodes;
    }

    returnElementCollection() {
        var array = [];
        array.push(this);
        for (var i = 0; i < this.childNodes.length; i++) {
            var tmparray = [];
            tmparray = this.childNodes[i].returnElementCollection();
            array = array.concat(tmparray);
        }

        return array;
    }

    getELementsByAttributeValue(attribute, value) {
        var matchedElements = [];
        var Elements = this.returnElementCollection();
        console.log(Elements.length);
        for (var i = 0; i < Elements.length; i++) {
            if (typeof Elements[i].attributes[attribute] != "undefined") {
                if (Elements[i].attributes[attribute] == value) {
                    matchedElements.push(Elements[i]);
                }
            }
        }

        return matchedElements;
    }

}

此节点对象将像html节点一样动作。所以我们在这里宣布另一个班级。

class Html_Node extends Node {
    constructor(name) {
        super(name, "HTML_ELEMENT");
    }

    toString() {

    }
}

class Xml_Node extends Node {
    constructor(name) {
        super(name, "XML_ELEMENT");

        this.innerText = "";
    }

}

在我们为两者提供课程后,我们转到困难部分阅读文档并在1个文档中构建我们的节点

class XML_Reader {
    constructor() {
        this.rawContents = "";
        this.Document = null;
    }

    loadXML(documentPath) {
        if (documentPath != null && documentPath != "") {
            var fs = require("fs");
            var fc = fs.readFileSync(documentPath, {
                encoding: "utf-8"
            });
            if (typeof fc != "undefined" && fc != null) {
                this.rawContents = fc;
            } else {
                this.rawContents = null;
            }

            delete require.cache[require.resolve("fs")];
        } else {
            this.rawContents = null;
        }


    }

    processXML() {

        var XML_DOC = new Xml_Node("root");
        var rawElements = [];
        var TagStart_index = 0;
        var TagEnd_index = 0;

        var innerContent_Start = 0;
        var innerContent_End = 0;
        for (var i = 0; i < this.rawContents.length; i++) {
            // get starting tags
            if (this.rawContents[i] == "<") {
                TagStart_index = i;
                innerContent_End = i - 1;

                var innerContent = "";
                if (innerContent_End > innerContent_Start) {
                    for (var n = innerContent_Start; n <= innerContent_End; n++) {
                        innerContent += this.rawContents[n];
                    }

                    if (/\S/.test(innerContent)) {
                        // do smth with innerContent of tag
                        rawElements.push(innerContent);
                    }
                }


            } else if (this.rawContents[i] == ">") {

                TagEnd_index = i;
                innerContent_Start = i + 1;
                var contents = "";

                for (var n = TagStart_index; n <= TagEnd_index; n++) {
                    contents += this.rawContents[n];
                }

                rawElements.push(contents);
            }

        }

        var currentParent = XML_DOC;
        for (var i = 0; i < rawElements.length; i++) {
            if (/>/.test(rawElements[i]) && /</.test(rawElements[i])) {
                if (rawElements[i].indexOf("/") == 1) {
                    currentParent = currentParent.parentNode;

                } else {
                    var str = rawElements[i];
                    str = str.replace("<", "");
                    str = str.replace(">", "");
                    var IgnoreSpace = false;
                    var tempString = "";
                    var InnerNodeContents = [];
                    var wordIndex = 0;
                    for (var n = 0; n < str.length; n++) {
                        if (!IgnoreSpace) {

                            if (str[n] == "/") {
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            }
                            if (n + 1 == str.length) {
                                tempString +=
                                    str[n];
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            } else if (!/\S/.test(str[n])) {
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            } else {
                                tempString += str[n];
                            }

                            if (str[n] == '"') IgnoreSpace = true;

                        } else {
                            if (str[n] == "/") {
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            }
                            if (n + 1 == str.length) {
                                tempString += str[n];
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            } else {
                                tempString += str[n];
                            }
                            if (str[n] == '"') IgnoreSpace = false;

                        }
                    }

                    var node = new Xml_Node(InnerNodeContents[0]);

                    // add attributes
                    var switchParent = false;

                    if (InnerNodeContents[InnerNodeContents.length - 1] == "/") {
                        switchParent = true;

                        for (var n = 1; n < InnerNodeContents.length - 1; n++) {
                            var tmparray = InnerNodeContents[n].split("=");
                            node.attributes[tmparray[0]] = tmparray[1].replaceAll('"', "");
                        }
                    } else {

                        for (var n = 1; n < InnerNodeContents.length; n++) {
                            var tmparray = InnerNodeContents[n].split("=");
                            node.attributes[tmparray[0]] = tmparray[1].replaceAll('"', "");
                        }
                    }

                    currentParent.appendChild(node);
                    if (!switchParent) currentParent = node;


                }
            } else {
                currentParent.innerText = rawElements[i];
            }

        }

        this.Document = XML_DOC;

    }

}

然后我们需要做的就是:

var xr = new XML_Reader();
xr.loadXML("Path to HTML file");
xr.processXML();

var Elements = xr.Document.getElementsByAttributeValue("class", "jd"); 

现在你在该元素变量中拥有带有jd类的Everysingle元素。

然后获取每个要执行FOR循环并获取href属性

的URL
var myUrl = Elements[0].attributes.href;

我为自己制作了这个剧本,所以请随意使用它:)

还有一件事。使用JD类来获取DIV的孩子你将需要获得该div并搜索nodeNames(a),并获得heir href属性。