html语义分析器失败

时间:2019-06-03 08:42:41

标签: java html dom semantic-analysis

我写了一个Java网络框架,例如Jsoup。现在我在DOM分析中遇到一些问题。我测试了HTML页面https://list.youku.com/show/id_zcc001f06962411de83b1.html的HTML分析器,但找不到正确的答案。

我在github上发布了我的代码,项目中有一些测试用例。我希望任何人都可以克隆代码并运行测试用例以找出答案。

这是我的github地址:https://github.com/sunyue1380/QuickHttp

HTMLParser:分析源字符串并生成HTMLToken列表

HTMLTokenParser:分析HTMLToken列表并构建DOM树。

1:HTMLParser测试正常,因此QuickHttp成功将源字符串划分为htmltoken(我在项目中定义)。

2:HTMLTokenParser测试失败,因此DOM构建过程失败。

HTMLToken:

public class HTMLToken {
    public int start;
    public int end;
    public String value;
    public TokenType tokenType;

    public String toString() {
        return value.replaceAll("\r\n", "换行符") + "[" + tokenType.name + "]";
    }

    public enum TokenType {
        openTag("开始标签"),
        tagName("标签名称"),
        attribute("标签属性"),
        openTagClose("开始标签结束"),
        textContent("标签文本内容"),
        closeTag("结束标签"),
        literal("在结束标签与开始标签之间的空白中"),
        commentTag("注释标签");

        private String name;

        TokenType(String name) {
            this.name = name;
        }
    }
}

HTMLParser:

/**词法分析*/
    private void parseHTML(){
        while(pos<chars.length){
            switch(state){
                case openingTag:{
                    if(isNextMatch("!--")){
                        //<!--comment-->
                        addToken(HTMLToken.TokenType.openTag);
                        state = State.inComment;
                    }else if(pos>0&&chars[pos-1]=='<'){
                        //<body
                        addToken(HTMLToken.TokenType.openTag);
                        state = State.inTagName;
                    }
                }break;
                case inTagName:{
                    if(chars[pos]==' '){
                        //<body id="identify">
                        addToken(HTMLToken.TokenType.tagName);
                        String tagName = tokenList.get(tokenList.size()-1).value.toLowerCase();
                        if(isSingleNode(tagName)){
                            singleNode = true;
                        }else {
                            singleNode = false;
                        }
                        state = State.inAttribute;
                    }else if(chars[pos]=='>'){
                        //<body> <input> <br/>
                        addToken(HTMLToken.TokenType.tagName);
                        String tagName = tokenList.get(tokenList.size()-1).value.toLowerCase();
                        if(isSingleNode(tagName)){
                            singleNode = true;
                            state = State.closingTag;
                        }else {
                            singleNode = false;
                            state = State.openTagClosing;
                        }
                    }else if(isNextMatch("/>")){
                        addToken(HTMLToken.TokenType.tagName);
                        state = State.closingTag;
                    }
                }break;
                case inComment:{
                    //<!--comment-->
                    if(chars[pos]=='>'&&chars[pos-1]=='-'&&chars[pos-2]=='-'){
                        addToken(HTMLToken.TokenType.commentTag);
                        singleNode = true;
                        state = State.closingTag;
                    }
                }break;
                case inAttribute:{
                    if(chars[pos]=='>'||(isNextMatch("?>"))){
                        addToken(HTMLToken.TokenType.attribute);
                        state = singleNode? State.closingTag: State.openTagClosing;
                    }else if(isNextMatch("/>")){
                        addToken(HTMLToken.TokenType.attribute);
                        state = State.closingTag;
                    }
                }break;
                case openTagClosing:{
                    //<input>
                    if(chars[pos-1]=='>'&&chars[pos]!='<'){
                        //<body>text</body>
                        addToken(HTMLToken.TokenType.openTagClose);
                        state = State.inTextContent;
                    }else if(isNextMatch("</")){
                        //<body></body>
                        addToken(HTMLToken.TokenType.openTagClose);
                        state = State.closingTag;
                    }else if(chars[pos]=='<'){
                        //<body><p></p>
                        addToken(HTMLToken.TokenType.openTagClose);
                        state = State.openingTag;
                    }
                }break;
                case inTextContent:{
                    if(isInStyleOrScript){
                        if(isNextMatch("</script>")||isNextMatch("</style>")){
                            addToken(HTMLToken.TokenType.textContent);
                            isInStyleOrScript = false;
                            state = State.closingTag;
                        }
                    }else if(isNextMatch("</")){
                        //<body>textContent</body>
                        addToken(HTMLToken.TokenType.textContent);
                        state = State.closingTag;
                    }else if(chars[pos]=='<'){
                        //<body>textContent<p></p>
                        addToken(HTMLToken.TokenType.textContent);
                        state = State.openingTag;
                    }
                }break;
                case closingTag:{
                    if(chars[pos-1]=='>'&&isNextMatch("</")){
                        //</body></html>
                        addToken(HTMLToken.TokenType.closeTag);
                    }else if(chars[pos-1]=='>'&&chars[pos]!='<'){
                        //</body>  </html>
                        addToken(HTMLToken.TokenType.closeTag);
                        state = State.inLiteral;
                    }else if(chars[pos-1]=='>'&&chars[pos]=='<'){
                        //</body><script>
                        addToken(HTMLToken.TokenType.closeTag);
                        state = State.openingTag;
                    }else if(pos==chars.length-1){
                        //</html>$
                        addToken(HTMLToken.TokenType.closeTag);
                        break;
                    }
                }break;
                case inLiteral:{
                    if(isNextMatch("</")){
                        //</body> </html>
                        addToken(HTMLToken.TokenType.literal);
                        state = State.closingTag;
                    }else if(chars[pos]=='<'){
                        //</body>   <p>
                        addToken(HTMLToken.TokenType.literal);
                        state = State.openingTag;
                    }
                }break;
            }
            pos++;
        }
        logger.trace("[Token列表]{}",tokenList.toString());
    }

HTMLTokenParser:

AbstractElement current = root;
        for(int i=0;i<htmlTokenList.size();i++){
            HTMLToken htmlToken = htmlTokenList.get(i);
            try {
                switch(htmlToken.tokenType){
                    case openTag:{
                        AbstractElement newElement = new AbstractElement();
                        allElements.add(newElement);
                        if(current==null){
                            root = newElement;
                        }else{
                            newElement.parent = current;
                            newElement.parent.childList.add(newElement);
                        }
                        current = newElement;
                    }break;
                    case tagName:{
                        current.tagName = htmlToken.value.toLowerCase();
                    }break;
                    case commentTag:{
                        current.isComment = true;
                        current.ownOriginText = htmlToken.value;
                        current.ownText = escapeOwnOriginText(current.ownOriginText);
                    }break;
                    case attribute:{
                        current.attribute = htmlToken.value;
                        current.attributes.putAll(AttributeParser.parse(htmlToken.value));
                    }break;
                    case openTagClose:{
                    }break;
                    case textContent:{
                        current.originTextNodes.add(htmlToken.value);
                        current.textNodes.add(escapeOwnOriginText(htmlToken.value));
                    }break;
                    case closeTag:{
                        if(htmlToken.value.equals(">")||htmlToken.value.equals("/>")){
                            current.isSingleNode = true;
                        }
//sometimes current may be null and i don't know why
                        current = current.parent;
                    }break;
                }
            }catch (Exception e){
                break;
            }
        }

元素:

class AbstractElement implements Element {
        /**节点名称*/
        private String tagName;
        /**是否是单节点*/
        private boolean isSingleNode;
        /**是否是注释节点*/
        private boolean isComment;
        /**父节点*/
        private AbstractElement parent;
        /**属性*/
        private Map<String,String> attributes = new HashMap<>();
        /**属性文本*/
        private String attribute = "";
        /**原始文本内容*/
        private String ownOriginText;
        /**转义后文本内容*/
        private String ownText;
        /**子节点*/
        private List<Element> childList = new ArrayList<>();

        /**深度遍历后的元素*/
        private Elements allElements;
        /**所有节点文本*/
        private String textContent;
        /**原始节点文本列表*/
        private List<String> originTextNodes = new ArrayList<>();
        /**转义节点文本列表*/
        private List<String> textNodes = new ArrayList<>();
        /**节点在父节点的子节点中的索引*/
        private int elementSiblingpos = -1;
        /**用于深度遍历*/
        private boolean isVisited;
}

url:https://list.youku.com/show/id_zcc001f06962411de83b1.html

实际结果:

case closeTag:{
                       if(htmlToken.value.equals(">")||htmlToken.value.equals("/>")){
                            current.isSingleNode = true;
                        }
//sometimes current may be null and i don't know why
                        current = current.parent;
                    }break;

current可能为null,这会导致DOM构建过程失败,我想知道为什么。

1 个答案:

答案 0 :(得分:0)

我已经解决了这个问题。 html代码

<input value="<iframe src='http://player.youku.com/embed/XNTQwMTgxMTE2' allowfullscreen></iframe>"/>

之前,我会将其剪切为

<,input,value="<iframe src='http://player.youku.com/embed/XNTQwMTgxMTE2' allowfullscreen,>,</iframe>,",/>

实际上应该是

<,input,value="<iframe src='http://player.youku.com/embed/XNTQwMTgxMTE2' allowfullscreen</iframe>",/>

我已经解决了这个问题,所以我将其关闭。