使用Java从HTML中提取文本,包括源代码行和代码

时间:2014-09-26 09:30:52

标签: java html html-parsing jsoup jericho-html-parser

如何使用Java从HTML中提取文本的问题已被查看和复制了数万次: Text Extraction from HTML Java

Thanks to在Stackoverflow上找到的答案我目前的状况是我正在使用JSoup

<!-- Jsoup maven dependency -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.7.3</version>
</dependency>

和这件作品或代码:

// parse the html from the givne string
Document doc = Jsoup.parse(html);
// loop over children elements of the body tag
for (Element el:doc.select("body").select("*")) {
  // loop over all textnodes of these children
  for (TextNode textNode:el.textNodes()) {
    // make sure there is some text other than whitespace
    if (textNode.text().trim().length()>0) {
        // show:
        //    the original node name
        //    the name of the subnode witht the text 
        //    the text 
        System.out.println(el.nodeName()+"."+textNode.nodeName()+":"+textNode.text());
    }
  }
}

现在我还要显示手头的textNode来自的行号和原始html源代码。我怀疑JSoup可以做到这一点(e.g. see

尝试解决方法:

int pos = html.indexOf(textNode.outerHtml());

无法可靠地找到原始的html。所以我假设我可能不得不切换到另一个库或方法。 Jericho-html: is it possible to extract text with reference to positions in source file?有一个答案说“杰里科可以做到这一点”#34;正如上面的链接也指出。但是缺少指向实际工作代码的指针。

我是杰里科,我得到了:

Source htmlSource=new Source(html);
boolean bodyFound=false;
// loop over all elements
for (net.htmlparser.jericho.Element el:htmlSource.getAllElements()) {
    if (el.getName().equals("body")) {
        bodyFound=true;
    }
    if (bodyFound) {
        TagType tagType = el.getStartTag().getTagType();
        if (tagType==StartTagType.NORMAL) {
            String text=el.getTextExtractor().toString();
            if (!text.trim().equals("")) {
                int cpos = el.getBegin();               
                System.out.println(el.getName()+"("+tagType.toString()+") line "+   htmlSource.getRow(cpos)+":"+text);
            }
        } // if
    } // if
} // for

这已经非常好了,因为它会为您提供如下输出:

body(normal) line 91: Some Header. Some Text
div(normal) line 93: Some Header
div(normal) line 95: Some Text

但现在后续问题是TextExtractor以递归方式输出所有子节点的全文,以便文本多次显示。

什么是过滤器以及上述JSoup解决方案的工作解决方案(请注意文本元素的正确顺序),但显示上面的Jericho Code代码片段的源代码行?

2 个答案:

答案 0 :(得分:2)

你需要和jsoup缺乏的功能是恕我直言更难实现。 和杰里科一起去实现这样的东西,找到直接的文本节点。

package main.java.com.adacom.task;

import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;

import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.EndTag;
import net.htmlparser.jericho.Segment;
import net.htmlparser.jericho.Source;
import net.htmlparser.jericho.StartTag;
import net.htmlparser.jericho.StartTagType;
import net.htmlparser.jericho.Tag;
import net.htmlparser.jericho.TagType;

public class MainParser {

    /**
     * @param args
     */
    public static void main(String[] args) {

        String html = "<body><div>divtextA<span>spanTextA<p>pText</p>spanTextB</span>divTextB</div></body>";

        Source htmlSource=new Source(html);
        boolean bodyFound=false;
        // loop over all elements
        for (net.htmlparser.jericho.Element el:htmlSource.getAllElements()) {
            if (el.getName().equals("body")) {
                bodyFound=true;
            }
            if (bodyFound) {
                TagType tagType = el.getStartTag().getTagType();
                if (tagType==StartTagType.NORMAL) {
                    String text = getOwnTextSegmentsString(el);
                    if (!text.trim().equals("")) {
                        int cpos = el.getBegin();               
                        System.out.println(el.getName()+"("+tagType.toString()+") line "+   htmlSource.getRow(cpos)+":"+text);
                    }
                } // if
            } // if
        } // for

    }

    /**
     * this function is not used it's shown here only for reference
     */ 
    public static Iterator<Segment> getOwnTextSegmentsIterator(Element elem) {
        final Iterator<Segment> it = elem.getContent().getNodeIterator();
        final List<Segment> results = new LinkedList<Segment>();
        int tagCounter = 0;
        while (it.hasNext()) {
            Segment cur = it.next();            
            if(cur instanceof StartTag) 
                tagCounter++;
            else if(cur instanceof EndTag) 
                tagCounter--;

            if (!(cur instanceof Tag) && tagCounter == 0) {
                System.out.println(cur);
                results.add(cur);
            }
        }
        return results.iterator();
    }

    public static String getOwnTextSegmentsString(Element elem) {
        final Iterator<Segment> it = elem.getContent().getNodeIterator();
        StringBuilder strBuilder = new StringBuilder();
        int tagCounter = 0;
        while (it.hasNext()) {
            Segment cur = it.next();            
            if(cur instanceof StartTag) 
                tagCounter++;
            else if(cur instanceof EndTag) 
                tagCounter--;

            if (!(cur instanceof Tag) && tagCounter == 0) {
                strBuilder.append(cur.toString() + ' ');
            }
        }
        return strBuilder.toString().trim();
    }

}

答案 1 :(得分:1)

这是一个测试预期输出的Junit测试和基于Jericho的SourceTextExtractor,它使JUnit测试工作基于原始的Jericho TextExtractor源代码。

@Test
public void testTextExtract() {
    // https://github.com/paepcke/CorEx/blob/master/src/extraction/HTMLUtils.java
    String htmls[] = {
            "<!DOCTYPE html>\n" + "<html>\n" + "<body>\n" + "\n"
                    + "<h1>My First Heading</h1>\n" + "\n"
                    + "<p>My first paragraph.</p>\n" + "\n" + "</body>\n" + "</html>",
            "<html>\n"
                    + "<body>\n"
                    + "\n"
                    + "<div id=\"myDiv\" name=\"myDiv\" title=\"Example Div Element\">\n"
                    + "  <h5>Subtitle</h5>\n"
                    + "  <p>This paragraph would be your content paragraph...</p>\n"
                    + "  <p>Here's another content article right here.</p>\n"
                    + "</div>" + "\n" + "Text at end of body</body>\n" + "</html>" };
    int expectedSize[] = { 2, 4 };
    String expectedInfo[][]={
        { 
            "line 5 col 5 to  line 5 col 21: My First Heading",
            "line 7 col 4 to  line 7 col 23: My first paragraph."
        },
        { 
            "line 5 col 7 to  line 5 col 15: Subtitle",
            "line 6 col 6 to  line 6 col 55: This paragraph would be your content paragraph...",
            "line 7 col 6 to  line 7 col 48: Here's another content article right here.",
            "line 8 col 7 to  line 9 col 20: Text at end of body"
        }
    };
    int i = 0;
    for (String html : htmls) {
        SourceTextExtractor extractor=new SourceTextExtractor();
        List<TextResult> textParts = extractor.extractTextSegments(html);
        // List<String> textParts = HTMLCleanerTextExtractor.extractText(html);
        int j=0;
        for (TextResult textPart : textParts) {
            System.out.println(textPart.getInfo());
            assertTrue(textPart.getInfo().startsWith(expectedInfo[i][j]));
            j++;
        }
        assertEquals(expectedSize[i], textParts.size());
        i++;
    }
}

这是一个改编的TextExtractor见 http://grepcode.com/file_/repo1.maven.org/maven2/net.htmlparser.jericho/jericho-html/3.3/net/htmlparser/jericho/TextExtractor.java/?v=source

/**
 * TextExtractor that makes source line and col references available
 * http://grepcode.com/file_/repo1.maven.org/maven2/net.htmlparser.jericho/jericho-html/3.3/net/htmlparser/jericho/TextExtractor.java/?v=source
 */
public class SourceTextExtractor {

    public static class TextResult {
        private String text;
        private Source root;
        private Segment segment;
        private int line;
        private int col;

        /**
         * get a textResult
         * @param root
         * @param segment
         */
        public TextResult(Source root,Segment segment) {
            this.root=root;
            this.segment=segment;
            final StringBuilder sb=new StringBuilder(segment.length());
            sb.append(segment);
            setText(CharacterReference.decodeCollapseWhiteSpace(sb));
            int spos = segment.getBegin();  
            line=root.getRow(spos);
            col=root.getColumn(spos);

        }

        /**
         * gets info about this TextResult
         * @return
         */
        public String getInfo() {
            int epos=segment.getEnd();

            String result=
                    " line "+   line+" col "+col+
                    " to "+
                    " line "+   root.getRow(epos)+" col "+root.getColumn(epos)+
                    ":"+getText();
            return result;
        }

        /**
         * @return the text
         */
        public String getText() {
            return text;
        }

        /**
         * @param text the text to set
         */
        public void setText(String text) {
            this.text = text;
        }

        public int getLine() {
            return line;
        }

        public int getCol() {
            return col;
        }

    }

    /**
     * extract textSegments from the given html
     * @param html
     * @return
     */
    public List<TextResult> extractTextSegments(String html) {
        Source htmlSource=new Source(html);
        List<TextResult> result = extractTextSegments(htmlSource);
        return result;
    }

    /**
     * get the TextSegments from the given root segment
     * @param root
     * @return
     */
    public List<TextResult> extractTextSegments(Source root) {
        List<TextResult> result=new ArrayList<TextResult>();
        for (NodeIterator nodeIterator=new NodeIterator(root); nodeIterator.hasNext();) {
            Segment segment=nodeIterator.next();
            if (segment instanceof Tag) {
                final Tag tag=(Tag)segment;
                if (tag.getTagType().isServerTag()) {
                    // elementContainsMarkup should be made into a TagType property one day.
                    // for the time being assume all server element content is code, although this is not true for some Mason elements.
                    final boolean elementContainsMarkup=false;
                    if (!elementContainsMarkup) {
                        final net.htmlparser.jericho.Element element=tag.getElement();
                        if (element!=null && element.getEnd()>tag.getEnd()) nodeIterator.skipToPos(element.getEnd());
                    }
                    continue;
                }
                if (tag.getTagType()==StartTagType.NORMAL) {
                    final StartTag startTag=(StartTag)tag;
                    if (tag.name==HTMLElementName.SCRIPT || tag.name==HTMLElementName.STYLE ||  (!HTMLElements.getElementNames().contains(tag.name))) {
                        nodeIterator.skipToPos(startTag.getElement().getEnd());
                        continue;
                    }

                }
                // Treat both start and end tags not belonging to inline-level elements as whitespace:
                if (tag.getName()==HTMLElementName.BR || !HTMLElements.getInlineLevelElementNames().contains(tag.getName())) {
                    // sb.append(' ');
                }
            } else {
                if (!segment.isWhiteSpace())
                    result.add(new TextResult(root,segment));
            }
        }
        return result;
    }

    /**
     * extract the text from the given segment
     * @param segment
     * @return
     */
    public String extractText(net.htmlparser.jericho.Segment pSegment) {

        // http://grepcode.com/file_/repo1.maven.org/maven2/net.htmlparser.jericho/jericho-html/3.3/net/htmlparser/jericho/TextExtractor.java/?v=source
        // this would call the code above
        // String result=segment.getTextExtractor().toString();
        final StringBuilder sb=new StringBuilder(pSegment.length());
        for (NodeIterator nodeIterator=new NodeIterator(pSegment); nodeIterator.hasNext();) {
            Segment segment=nodeIterator.next();
            if (segment instanceof Tag) {
                final Tag tag=(Tag)segment;
                if (tag.getTagType().isServerTag()) {
                    // elementContainsMarkup should be made into a TagType property one day.
                    // for the time being assume all server element content is code, although this is not true for some Mason elements.
                    final boolean elementContainsMarkup=false;
                    if (!elementContainsMarkup) {
                        final net.htmlparser.jericho.Element element=tag.getElement();
                        if (element!=null && element.getEnd()>tag.getEnd()) nodeIterator.skipToPos(element.getEnd());
                    }
                    continue;
                }
                if (tag.getTagType()==StartTagType.NORMAL) {
                    final StartTag startTag=(StartTag)tag;
                    if (tag.name==HTMLElementName.SCRIPT || tag.name==HTMLElementName.STYLE ||  (!HTMLElements.getElementNames().contains(tag.name))) {
                        nodeIterator.skipToPos(startTag.getElement().getEnd());
                        continue;
                    }

                }
                // Treat both start and end tags not belonging to inline-level elements as whitespace:
                if (tag.getName()==HTMLElementName.BR || !HTMLElements.getInlineLevelElementNames().contains(tag.getName())) {
                    sb.append(' ');
                }
            } else {
                sb.append(segment);
            }
        }
        final String result=net.htmlparser.jericho.CharacterReference.decodeCollapseWhiteSpace(sb);
        return result;
    }
}