我写了一个Java网络框架,例如Jsoup。现在我在DOM分析中遇到一些问题。我测试了HTML页面https://list.youku.com/show/id_zcc001f06962411de83b1.html的HTML分析器,但找不到正确的答案。
我在github上发布了我的代码,项目中有一些测试用例。我希望任何人都可以克隆代码并运行测试用例以找出答案。
这是我的github地址:https://github.com/sunyue1380/QuickHttp
HTMLParser:分析源字符串并生成HTMLToken列表
HTMLTokenParser:分析HTMLToken列表并构建DOM树。
1:HTMLParser测试正常,因此QuickHttp成功将源字符串划分为htmltoken(我在项目中定义)。
2:HTMLTokenParser测试失败,因此DOM构建过程失败。
HTMLToken:
public class HTMLToken {
public int start;
public int end;
public String value;
public TokenType tokenType;
public String toString() {
return value.replaceAll("\r\n", "换行符") + "[" + tokenType.name + "]";
}
public enum TokenType {
openTag("开始标签"),
tagName("标签名称"),
attribute("标签属性"),
openTagClose("开始标签结束"),
textContent("标签文本内容"),
closeTag("结束标签"),
literal("在结束标签与开始标签之间的空白中"),
commentTag("注释标签");
private String name;
TokenType(String name) {
this.name = name;
}
}
}
HTMLParser:
/**词法分析*/
private void parseHTML(){
while(pos<chars.length){
switch(state){
case openingTag:{
if(isNextMatch("!--")){
//<!--comment-->
addToken(HTMLToken.TokenType.openTag);
state = State.inComment;
}else if(pos>0&&chars[pos-1]=='<'){
//<body
addToken(HTMLToken.TokenType.openTag);
state = State.inTagName;
}
}break;
case inTagName:{
if(chars[pos]==' '){
//<body id="identify">
addToken(HTMLToken.TokenType.tagName);
String tagName = tokenList.get(tokenList.size()-1).value.toLowerCase();
if(isSingleNode(tagName)){
singleNode = true;
}else {
singleNode = false;
}
state = State.inAttribute;
}else if(chars[pos]=='>'){
//<body> <input> <br/>
addToken(HTMLToken.TokenType.tagName);
String tagName = tokenList.get(tokenList.size()-1).value.toLowerCase();
if(isSingleNode(tagName)){
singleNode = true;
state = State.closingTag;
}else {
singleNode = false;
state = State.openTagClosing;
}
}else if(isNextMatch("/>")){
addToken(HTMLToken.TokenType.tagName);
state = State.closingTag;
}
}break;
case inComment:{
//<!--comment-->
if(chars[pos]=='>'&&chars[pos-1]=='-'&&chars[pos-2]=='-'){
addToken(HTMLToken.TokenType.commentTag);
singleNode = true;
state = State.closingTag;
}
}break;
case inAttribute:{
if(chars[pos]=='>'||(isNextMatch("?>"))){
addToken(HTMLToken.TokenType.attribute);
state = singleNode? State.closingTag: State.openTagClosing;
}else if(isNextMatch("/>")){
addToken(HTMLToken.TokenType.attribute);
state = State.closingTag;
}
}break;
case openTagClosing:{
//<input>
if(chars[pos-1]=='>'&&chars[pos]!='<'){
//<body>text</body>
addToken(HTMLToken.TokenType.openTagClose);
state = State.inTextContent;
}else if(isNextMatch("</")){
//<body></body>
addToken(HTMLToken.TokenType.openTagClose);
state = State.closingTag;
}else if(chars[pos]=='<'){
//<body><p></p>
addToken(HTMLToken.TokenType.openTagClose);
state = State.openingTag;
}
}break;
case inTextContent:{
if(isInStyleOrScript){
if(isNextMatch("</script>")||isNextMatch("</style>")){
addToken(HTMLToken.TokenType.textContent);
isInStyleOrScript = false;
state = State.closingTag;
}
}else if(isNextMatch("</")){
//<body>textContent</body>
addToken(HTMLToken.TokenType.textContent);
state = State.closingTag;
}else if(chars[pos]=='<'){
//<body>textContent<p></p>
addToken(HTMLToken.TokenType.textContent);
state = State.openingTag;
}
}break;
case closingTag:{
if(chars[pos-1]=='>'&&isNextMatch("</")){
//</body></html>
addToken(HTMLToken.TokenType.closeTag);
}else if(chars[pos-1]=='>'&&chars[pos]!='<'){
//</body> </html>
addToken(HTMLToken.TokenType.closeTag);
state = State.inLiteral;
}else if(chars[pos-1]=='>'&&chars[pos]=='<'){
//</body><script>
addToken(HTMLToken.TokenType.closeTag);
state = State.openingTag;
}else if(pos==chars.length-1){
//</html>$
addToken(HTMLToken.TokenType.closeTag);
break;
}
}break;
case inLiteral:{
if(isNextMatch("</")){
//</body> </html>
addToken(HTMLToken.TokenType.literal);
state = State.closingTag;
}else if(chars[pos]=='<'){
//</body> <p>
addToken(HTMLToken.TokenType.literal);
state = State.openingTag;
}
}break;
}
pos++;
}
logger.trace("[Token列表]{}",tokenList.toString());
}
HTMLTokenParser:
AbstractElement current = root;
for(int i=0;i<htmlTokenList.size();i++){
HTMLToken htmlToken = htmlTokenList.get(i);
try {
switch(htmlToken.tokenType){
case openTag:{
AbstractElement newElement = new AbstractElement();
allElements.add(newElement);
if(current==null){
root = newElement;
}else{
newElement.parent = current;
newElement.parent.childList.add(newElement);
}
current = newElement;
}break;
case tagName:{
current.tagName = htmlToken.value.toLowerCase();
}break;
case commentTag:{
current.isComment = true;
current.ownOriginText = htmlToken.value;
current.ownText = escapeOwnOriginText(current.ownOriginText);
}break;
case attribute:{
current.attribute = htmlToken.value;
current.attributes.putAll(AttributeParser.parse(htmlToken.value));
}break;
case openTagClose:{
}break;
case textContent:{
current.originTextNodes.add(htmlToken.value);
current.textNodes.add(escapeOwnOriginText(htmlToken.value));
}break;
case closeTag:{
if(htmlToken.value.equals(">")||htmlToken.value.equals("/>")){
current.isSingleNode = true;
}
//sometimes current may be null and i don't know why
current = current.parent;
}break;
}
}catch (Exception e){
break;
}
}
元素:
class AbstractElement implements Element {
/**节点名称*/
private String tagName;
/**是否是单节点*/
private boolean isSingleNode;
/**是否是注释节点*/
private boolean isComment;
/**父节点*/
private AbstractElement parent;
/**属性*/
private Map<String,String> attributes = new HashMap<>();
/**属性文本*/
private String attribute = "";
/**原始文本内容*/
private String ownOriginText;
/**转义后文本内容*/
private String ownText;
/**子节点*/
private List<Element> childList = new ArrayList<>();
/**深度遍历后的元素*/
private Elements allElements;
/**所有节点文本*/
private String textContent;
/**原始节点文本列表*/
private List<String> originTextNodes = new ArrayList<>();
/**转义节点文本列表*/
private List<String> textNodes = new ArrayList<>();
/**节点在父节点的子节点中的索引*/
private int elementSiblingpos = -1;
/**用于深度遍历*/
private boolean isVisited;
}
url:https://list.youku.com/show/id_zcc001f06962411de83b1.html
实际结果:
case closeTag:{
if(htmlToken.value.equals(">")||htmlToken.value.equals("/>")){
current.isSingleNode = true;
}
//sometimes current may be null and i don't know why
current = current.parent;
}break;
current可能为null,这会导致DOM构建过程失败,我想知道为什么。
答案 0 :(得分:0)
我已经解决了这个问题。 html代码
<input value="<iframe src='http://player.youku.com/embed/XNTQwMTgxMTE2' allowfullscreen></iframe>"/>
之前,我会将其剪切为
<,input,value="<iframe src='http://player.youku.com/embed/XNTQwMTgxMTE2' allowfullscreen,>,</iframe>,",/>
实际上应该是
<,input,value="<iframe src='http://player.youku.com/embed/XNTQwMTgxMTE2' allowfullscreen</iframe>",/>
我已经解决了这个问题,所以我将其关闭。