为什么chrome devtools中的html代码和jsoup解析的html代码不同?

时间:2019-07-11 04:11:46

标签: java html google-chrome-devtools jsoup html-parsing

我正在尝试从HADOOP Jira发行网站(https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues)中提取有关发行日期的信息

如您在此Screenshot中所见,创建日期是时间标记之间的文本,该时间标记的类别是实时戳记(例如[customshortcode type="other" size="small" sort="rand" links="no"]

所以,我尝试使用以下代码解析它。

<time class=livestamp ...> 'this text' </time>

我希望提取创建的日期,但实际输出是 元素数:0

我发现这有问题。因此,我尝试使用以下代码从那一侧解析整个html代码。

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class CreatedDateExtractor {
    public static void main(String[] args) {
        String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
        Document doc = null;

        try {
            doc = Jsoup.connect(url).get();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
        System.out.println("# of elements : "+ elements.size());
        for(Element e: elements) {
            System.out.println(e.text());
        }   
    }
}



我比较了chrome devtools中的html代码和我一一解析的html代码。然后我发现那些不同。

您能解释一下为什么会发生这种情况,并给我一些有关如何提取创建日期的建议吗?

2 个答案:

答案 0 :(得分:0)

我建议您获取带有“ time”标签的元素,并使用select获取具有“ livestamp”类的时间标签。这是示例:

team_and_opponent

我不知道为什么,但是当我想将Jsoup的.select()方法与多个选择器一起使用时(就像您使用的time.livestamp一样),我会得到类似这样的有趣输出。

答案 1 :(得分:-1)

import Torello.HTML.*;
import Torello.HTML.NodeSearch.*;
import Torello.Java.*;

import java.util.*;
import java.io.*;
import java.net.*;

public class Scrape
{
    public static void main(String[] argv) throws IOException
    {
        // This URL does not appear to have an HTML Element with a "TimeStamp" as you have stated.
        // ==> Go to any browser and view it for yourself!  (Click "View Source" in Google-Chrome, I.E., Safari, etc...)
        // URL url = new URL("https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues");

        URL url = new URL("https://some.url.org/");

        // This scrapes the web-page into a standard Java-Vector.
        // HTMLNode is abstract, and has only 2 classes that inherit it.  (3 actually, but one is the "CommentNode")
        Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false);

        // This will output each & every node in the page to a text/html file called "output.html"
        // Read Documentation Files for "Util.pageToString" and "FileRW.writeFile"
        FileRW.writeFile(Util.pageToString(page), "output.html");

        // If this is the question to identify:
        // As you can see in this Screenshot, created date is the text between the time tag whose class is
        // live stamp(e.g. <time class=livestamp ...> 'this text' </time>)
        //
        // Using the "NodeSearch.InnerTagGetInclusive" class will retrieve the information you need
        Vector<HTMLNode> liveStamp = InnerTagGetInclusive.first(page, "time", "class", TextComparitor.CN_CI, "livestamp");

        // This will get eliminate of all the "TagNode" elements when building a this String.
        // It will leave you with only the "TextNode" elements.
        // This remaining TextNode's should, indeed, be the the "this text" as a string.
        String liveStampStr = Util.textNodesString(liveStamp);

        System.out.println("Live-Stamp: " + liveStampStr);
    }
}