Question

我正在尝试从HADOOP Jira发行网站（https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues）中提取有关发行日期的信息

如您在此Screenshot中所见，创建日期是时间标记之间的文本，该时间标记的类别是实时戳记（例如[customshortcode type="other" size="small" sort="rand" links="no"]）

所以，我尝试使用以下代码解析它。

<time class=livestamp ...> 'this text' </time>

我希望提取创建的日期，但实际输出是 元素数：0 。

我发现这有问题。因此，我尝试使用以下代码从那一侧解析整个html代码。

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class CreatedDateExtractor {
    public static void main(String[] args) {
        String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
        Document doc = null;

        try {
            doc = Jsoup.connect(url).get();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
        System.out.println("# of elements : "+ elements.size());
        for(Element e: elements) {
            System.out.println(e.text());
        }   
    }
}

我比较了chrome devtools中的html代码和我一一解析的html代码。然后我发现那些不同。

您能解释一下为什么会发生这种情况，并给我一些有关如何提取创建日期的建议吗？

Answer 1

我建议您获取带有“ time”标签的元素，并使用select获取具有“ livestamp”类的时间标签。这是示例：

team_and_opponent

我不知道为什么，但是当我想将Jsoup的.select（）方法与多个选择器一起使用时（就像您使用的time.livestamp一样），我会得到类似这样的有趣输出。

Answer 2

import Torello.HTML.*;
import Torello.HTML.NodeSearch.*;
import Torello.Java.*;

import java.util.*;
import java.io.*;
import java.net.*;

public class Scrape
{
    public static void main(String[] argv) throws IOException
    {
        // This URL does not appear to have an HTML Element with a "TimeStamp" as you have stated.
        // ==> Go to any browser and view it for yourself!  (Click "View Source" in Google-Chrome, I.E., Safari, etc...)
        // URL url = new URL("https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues");

        URL url = new URL("https://some.url.org/");

        // This scrapes the web-page into a standard Java-Vector.
        // HTMLNode is abstract, and has only 2 classes that inherit it.  (3 actually, but one is the "CommentNode")
        Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false);

        // This will output each & every node in the page to a text/html file called "output.html"
        // Read Documentation Files for "Util.pageToString" and "FileRW.writeFile"
        FileRW.writeFile(Util.pageToString(page), "output.html");

        // If this is the question to identify:
        // As you can see in this Screenshot, created date is the text between the time tag whose class is
        // live stamp(e.g. <time class=livestamp ...> 'this text' </time>)
        //
        // Using the "NodeSearch.InnerTagGetInclusive" class will retrieve the information you need
        Vector<HTMLNode> liveStamp = InnerTagGetInclusive.first(page, "time", "class", TextComparitor.CN_CI, "livestamp");

        // This will get eliminate of all the "TagNode" elements when building a this String.
        // It will leave you with only the "TextNode" elements.
        // This remaining TextNode's should, indeed, be the the "this text" as a string.
        String liveStampStr = Util.textNodesString(liveStamp);

        System.out.println("Live-Stamp: " + liveStampStr);
    }
}

为什么chrome devtools中的html代码和jsoup解析的html代码不同？

2 个答案: