我正在尝试从HADOOP Jira发行网站(https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues)中提取有关发行日期的信息
如您在此Screenshot中所见,创建日期是时间标记之间的文本,该时间标记的类别是实时戳记(例如[customshortcode type="other" size="small" sort="rand" links="no"]
)
所以,我尝试使用以下代码解析它。
<time class=livestamp ...> 'this text' </time>
我希望提取创建的日期,但实际输出是 元素数:0 。
我发现这有问题。因此,我尝试使用以下代码从那一侧解析整个html代码。
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class CreatedDateExtractor {
public static void main(String[] args) {
String url = "https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements elements = doc.select("time.livestamp"); //This line finds elements that matches time tags with livestamp class
System.out.println("# of elements : "+ elements.size());
for(Element e: elements) {
System.out.println(e.text());
}
}
}
我比较了chrome devtools中的html代码和我一一解析的html代码。然后我发现那些不同。
您能解释一下为什么会发生这种情况,并给我一些有关如何提取创建日期的建议吗?
答案 0 :(得分:0)
我建议您获取带有“ time”标签的元素,并使用select获取具有“ livestamp”类的时间标签。这是示例:
team_and_opponent
我不知道为什么,但是当我想将Jsoup的.select()方法与多个选择器一起使用时(就像您使用的time.livestamp一样),我会得到类似这样的有趣输出。
答案 1 :(得分:-1)
import Torello.HTML.*;
import Torello.HTML.NodeSearch.*;
import Torello.Java.*;
import java.util.*;
import java.io.*;
import java.net.*;
public class Scrape
{
public static void main(String[] argv) throws IOException
{
// This URL does not appear to have an HTML Element with a "TimeStamp" as you have stated.
// ==> Go to any browser and view it for yourself! (Click "View Source" in Google-Chrome, I.E., Safari, etc...)
// URL url = new URL("https://issues.apache.org/jira/projects/HADOOP/issues/HADOOP-16381?filter=allopenissues");
URL url = new URL("https://some.url.org/");
// This scrapes the web-page into a standard Java-Vector.
// HTMLNode is abstract, and has only 2 classes that inherit it. (3 actually, but one is the "CommentNode")
Vector<HTMLNode> page = HTMLPage.getPageTokens(url, false);
// This will output each & every node in the page to a text/html file called "output.html"
// Read Documentation Files for "Util.pageToString" and "FileRW.writeFile"
FileRW.writeFile(Util.pageToString(page), "output.html");
// If this is the question to identify:
// As you can see in this Screenshot, created date is the text between the time tag whose class is
// live stamp(e.g. <time class=livestamp ...> 'this text' </time>)
//
// Using the "NodeSearch.InnerTagGetInclusive" class will retrieve the information you need
Vector<HTMLNode> liveStamp = InnerTagGetInclusive.first(page, "time", "class", TextComparitor.CN_CI, "livestamp");
// This will get eliminate of all the "TagNode" elements when building a this String.
// It will leave you with only the "TextNode" elements.
// This remaining TextNode's should, indeed, be the the "this text" as a string.
String liveStampStr = Util.textNodesString(liveStamp);
System.out.println("Live-Stamp: " + liveStampStr);
}
}