通常我们在文件中有很多内部链接。我想解析一个html文件,以便在地图中获取页面的标题及其相应的数据。
我做的步骤:
1)获得所有内部参考元素
2)解析了id = XXX的文档,其中XXX ==(元素<a href="#XXX"
)
3)它带我去<span id="XXX">little text here </span> <some tags here too ><p> actual text here </p> <p> here too </p>
4)如何从<span> to <p>
开始
5)我尝试过跨越的父母,并认为其中一个孩子也是<p>
......这是真的。但它也涉及<p>
其他内部链接。
编辑:添加了一个示例html文件部分:
<li class="toclevel-1 tocsection-1"><a href="#Enforcing_mutual_exclusion">
<span class="tocnumber">1</span> <span class="toctext">Enforcing mutual exclusion</span> </a><ul>
<li class="toclevel-2 tocsection-2"><a href="#Hardware_solutions">
<span class="tocnumber">1.1</span> <span class="toctext">Hardware solutions</span>
</a></li>
<li class="toclevel-2 tocsection-3"><a href="#Software_solutions">
<h2><span class="editsection">[<a href="/w/index.php?title=Mutual_exclusion&
amp;action=edit&section=1" title="Edit section: Enforcing mutual exclusion">
edit</a>]</span> <span class="mw-headline" id="Enforcing_mutual_exclusion">
<comment --------------------------------------------------------------------
**see the id above = Enforcing_mutual_exclusion** which is same as first internal
link . Jsoup takes me to this span element. i want to access every <p> element after
this <span> tag before another <span> tag with id="any of the internal links"
------------------------------------------------------------------------------!>
Enforcing mutual exclusion</span></h2>
<p>There are both software and hardware solutions for enforcing mutual exclusion.
The different solutions are shown below.</p>
<h3><span class="editsection">[<a href="/w/index.php?title=Mutual_exclusion&
amp;action=edit&section=2" title="Edit section: Hardware solutions">
edit</a>]</span> <span class="mw-headline" id="Hardware_solutions">Hardware
solutions</span></h3>
<p>On a <a href="/wiki/Uniprocessor" title="Uniprocessor" class="mw-
redirect">uniprocessor</a> system a common way to achieve mutual exclusion inside
<a href="/wiki/Kernel_(computing)" title="Kernel (computing)">kernels</a> is
disable <a href="/wiki/Interrupt" title="Interrupt">
这是我的代码:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public final class Website {
private URL websiteURL ;
private Document httpDoc ;
LinkedHashMap<String, ArrayList<String>> internalLinks =
new LinkedHashMap<String, ArrayList<String>>();
public Website(URL __websiteURL) throws MalformedURLException, IOException, Exception{
if(__websiteURL == null)
throw new Exception();
websiteURL = __websiteURL;
httpDoc = Jsoup.parse(connect());
System.out.println("Parsed the http file to Document");
}
/* Here is my function: i first gets all the internal links in internalLinksElements.
I then get the href name of <a ..> tag so that i can search for it in documnet.
*/
public void getDataWithHeadingsTogether(){
Elements internalLinksElements;
internalLinksElements = httpDoc.select("a[href^=#]");
for(Element element : internalLinksElements){
// some inline links were bad. i only those having span as their child.
Elements spanElements = element.select("span");
if(!spanElements.isEmpty()){
System.out.println("Text(): " + element.text()); // this can not give what i want
// ok i get the href tag name that would be the id
String href = element.attr("href") ;
href = href.replace("#", "");
System.out.println(href);
// selecting the element where we have that id.
Element data = httpDoc.getElementById(href);
// got the span
if(data == null)
continue;
Elements children = new Elements();
// problem is here.
while(children.isEmpty()){
// going to its element unless gets some data.
data = data.parent();
System.out.println(data);
children = data.select("p");
}
// its giving me all the data of file. thats bad.
System.out.println(children.text());
}
}
}
/**
*
* @return String Get all the headings of the document.
* @throws MalformedURLException
* @throws IOException
*/
@SuppressWarnings("CallToThreadDumpStack")
public String connect() throws MalformedURLException, IOException{
// Is this thread safe ? url.openStream();
BufferedReader reader = null;
try{
reader = new BufferedReader( new InputStreamReader(websiteURL.openStream()));
System.out.println("Got the reader");
} catch(Exception e){
e.printStackTrace();
System.out.println("Bye");
String html = "<html><h1>Heading 1</h1><body><h2>Heading 2</h2><p>hello</p></body></html>";
return html;
}
String inputLine, result = new String();
while((inputLine = reader.readLine()) != null){
result += inputLine;
}
reader.close();
System.out.println("Made the html file");
return result;
}
/**
*
* @param argv all the command line parameters.
* @throws MalformedURLException
* @throws IOException
*/
public static void main(String[] argv) throws MalformedURLException, IOException, Exception{
System.setProperty("proxyHost", "172.16.0.3");
System.setProperty("proxyPort","8383");
System.out.println("Sending url");
// a html file or any url place here ------------------------------------
URL url = new URL("put a html file here ");
Website website = new Website(url);
System.out.println(url.toString());
System.out.println("++++++++++++++++++++++++++++++++++++++++++++++++");
website.getDataWithHeadingsTogether();
}
}
答案 0 :(得分:-1)
我认为您需要了解您所查找的<span>
是标题元素的子元素,并且您要存储的数据由该标题的兄弟组成。
因此,您需要获取<span>
的{{3}},然后使用parent来收集作为<span>
数据的节点。当兄弟姐妹用完时,或者遇到另一个标题元素时,您需要停止收集数据,因为另一个标题表示下一个项目数据的开始。