我正在使用Jericho HTML Parser来解析一些格式错误的HTML。特别是我试图获取所有文本节点,处理文本然后替换它。
我想跳过处理中的特定元素。例如,我想跳过所有元素,以及任何具有属性class =“noProcess”的元素。所以,如果一个div有class =“noProcess”那么我想跳过这个div和所有孩子来处理。但是,我确实希望这些跳过的元素在处理后返回到输出。
Jericho为所有节点提供了迭代器,但我不确定如何跳过迭代器中的完整元素。这是我的代码:
private String doProcessHtml(String html) {
Source source = new Source(html);
OutputDocument outputDocument = new OutputDocument(source);
for (Segment segment : source) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
// DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"
} else if (segment instanceof CharacterReference) {
CharacterReference characterReference = (CharacterReference) segment;
System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
} else {
System.out.println("FOUND PLAIN TEXT: " + segment.toString());
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
return outputDocument.toString();
}
使用ignoreWhenParsing()方法看起来不像我,因为解析器只是将“ignored”元素视为文本。
我在想,如果我可以将Iterator循环转换为for(int i = 0; ...)循环,我可以通过修改i指向EndTag来跳过元素及其所有子元素。然后继续循环....但不确定。
答案 0 :(得分:0)
这应该有用。
String skipTag = null;
for (Segment segment : source) {
if (skipTag != null) { // is skipping ON?
if (segment instanceof EndTag && // if EndTag found for the
skipTag.equals(((EndTag) segment).getName())) { // tag we're skipping
skipTag = null; // set skipping OFF
}
continue; // continue skipping (or skip the EndTag)
} else if (segment instanceof Tag) { // is tag?
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
if (HTMLElementName.A.equals(tag.getName()) { // if <a> ?
skipTag = tag.getName(); // set
continue; // skipping ON
} else if (tag instanceof StartTag) {
if ("noProcess".equals( // if <tag class="noProcess" ..> ?
((StartTag) tag).getAttributeValue("class"))) {
skipTag = tag.getName(); // set
continue; // skipping ON
}
}
} // ...
}
答案 1 :(得分:0)
我认为您可能需要考虑重新设计细分市场的方式。有没有办法以这样的方式解析html:每个段是包含嵌套的子元素列表的父元素?这样你可以做类似的事情:
for (Segment segment : source) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
System.out.println("FOUND TAG: " + tag.getName());
// DO SOMETHING HERE TO SKIP ENTIRE ELEMENT IF IS <A> OR CLASS="noProcess"
continue;
} else if (segment instanceof CharacterReference) {
CharacterReference characterReference = (CharacterReference) segment;
System.out.println("FOUND CHARACTERREFERENCE: " + characterReference.getCharacterReferenceString());
for(Segment child : segment.childNodes()) {
//Use recursion to process child elements
//You will want to put your for loop in a separate method so it can be called recursively.
}
} else {
System.out.println("FOUND PLAIN TEXT: " + segment.toString());
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
如果没有更多的代码来检查它是否很难确定重组段元素是否可行或值得付出努力。
答案 2 :(得分:0)
通过使用Tag的Element对象的getEnd()方法来管理工作解决方案。我们的想法是,如果元素的结束位置小于您设置的位置,则跳过元素。因此,您可以找到要排除的元素的结束位置,并且在该位置之前不会处理任何其他内容:
final ArrayList<String> excludeTags = new ArrayList<String>(Arrays.asList(new String[] {"head", "script", "a"}));
final ArrayList<String> excludeClasses = new ArrayList<String>(Arrays.asList(new String[] {"noProcess"}));
Source.LegacyIteratorCompatabilityMode = true;
Source source = new Source(htmlToProcess);
OutputDocument outputDocument = new OutputDocument(source);
int skipToPos = 0;
for (Segment segment : source) {
if (segment.getBegin() >= skipToPos) {
if (segment instanceof Tag) {
Tag tag = (Tag) segment;
Element element = tag.getElement();
// check excludeTags
if (excludeTags.contains(tag.getName().toLowerCase())) {
skipToPos = element.getEnd();
}
// check excludeClasses
String classes = element.getAttributeValue("class");
if (classes != null) {
for (String theClass : classes.split(" ")) {
if (excludeClasses.contains(theClass.toLowerCase())) {
skipToPos = element.getEnd();
}
}
}
} else if (segment instanceof CharacterReference) { // for future use. Source.LegacyIteratorCompatabilityMode = true;
CharacterReference characterReference = (CharacterReference) segment;
} else {
outputDocument.replace(segment, doProcessText(segment.toString()));
}
}
}
return outputDocument.toString();