我想访问html文件中包含的表格。这是我的代码:
import java.io.*;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.*;
import com.gargoylesoftware.htmlunit.WebClient;
public class test {
public static void main(String[] args) throws Exception {
WebClient client = new WebClient();
HtmlPage currentPage = client.getPage("http://www.mysite.com");
client.waitForBackgroundJavaScript(10000);
final HtmlDivision div = (HtmlDivision) currentPage.getByXPath("//div[@id='table-matches-time']");
String textSource = div.toString();
//String textSource = currentPage.asXml();
FileWriter fstream = new FileWriter("index.txt");
BufferedWriter out = new BufferedWriter(fstream);
out.write(textSource);
out.close();
client.closeAllWindows();
}
}
表格采用以下形式:
<div id="table-matches-time" class="">
<table class=" table-main">
但是我收到了这个错误:
Exception in thread "main" java.lang.ClassCastException: java.util.ArrayList cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlDivision
at test.main(test.java:20)
我怎么读这张桌子?
答案 0 :(得分:5)
这可行(并返回一个csv文件;)):
import java.io.*;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTable;
import com.gargoylesoftware.htmlunit.html.HtmlTableRow;
import com.gargoylesoftware.htmlunit.html.*;
import com.gargoylesoftware.htmlunit.WebClient;
public class test {
public static void main(String[] args) throws Exception {
WebClient client = new WebClient();
HtmlPage currentPage = client.getPage("http://www.mysite.com");
client.waitForBackgroundJavaScript(10000);
FileWriter fstream = new FileWriter("index.txt");
BufferedWriter out = new BufferedWriter(fstream);
for (int i=0;i<2;i++){
final HtmlTable table = (HtmlTable) currentPage.getByXPath("//table[@class=' table-main']").get(i);
for (final HtmlTableRow row : table.getRows()) {
for (final HtmlTableCell cell : row.getCells()) {
out.write(cell.asText()+',');
}
out.write('\n');
}
}
out.close();
client.closeAllWindows();
}
}
答案 1 :(得分:0)
看起来您的查询返回的是节点列表,而不是单个div。你有多个带有该ID的物品吗?
答案 2 :(得分:0)
替换这部分代码:
(HtmlDivision) currentPage.getByXPath("//div[@id='table-matches-time']");
使用:
(HtmlDivision) currentPage.getFirstByXPath("//div[@id='table-matches-time']");
第一种方法总是返回一个元素集合,即使它是一个元素,而第二种方法总是返回一个元素,即使有更多元素。
修改强>
由于你有两个具有相同id
的元素(根本不可取),你应该使用它:
(HtmlDivision) currentPage.getByXPath("//div[@id='table-matches-time']").get(0);
这样你就可以获得该系列的第一个元素。 .get(1);
会让你成为第二个。