我试图从一个警察局(Garda是警察爱尔兰人)的一个简单的html犯罪统计表中解析一个java项目中保存的HTML文档。目前我正在尝试解析html文档中的内容并将其打印到控制台。我遇到的问题是,我只能打印表中的数字(不包括年份),但我想要实现的是从表中的犯罪名称,后跟6个数字,跟随。
Screenshot of the html table (Cannot embed the image as my reputation is too low)
HTML TABLE
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>Recorded Crime Offences (Number) by Garda Station, Type of Offence and<BR>
Year</title>
</head>
<body>
<table border="">
<tbody><tr align="LEFT">
<th colspan="8">Recorded Crime Offences (Number) by Garda Station, Type of Offence and<br>
Year</th>
</tr>
<tr align="LEFT">
<th colspan="2"> </th>
<th valign="TOP" colspan="1">2011</th>
<th valign="TOP" colspan="1">2012</th>
<th valign="TOP" colspan="1">2013</th>
<th valign="TOP" colspan="1">2014</th>
<th valign="TOP" colspan="1">2015</th>
<th valign="TOP" colspan="1">2016</th>
</tr>
<tr align="RIGHT">
<th align="LEFT" valign="TOP" rowspan="12">Balbriggan, D.M.R. Northern Division</th>
<th align="LEFT">03 ,Attempts/threats to murder, assaults, harassments and related offences</th>
<td>96</td>
<td>89</td>
<td>70</td>
<td>97</td>
<td>103</td>
<td>103</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">04 ,Dangerous or negligent acts</th>
<td>72</td>
<td>67</td>
<td>50</td>
<td>53</td>
<td>45</td>
<td>43</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">05 ,Kidnapping and related offences</th>
<td>0</td>
<td>0</td>
<td>1</td>
<td>3</td>
<td>0</td>
<td>7</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">06 ,Robbery, extortion and hijacking offences</th>
<td>16</td>
<td>19</td>
<td>16</td>
<td>7</td>
<td>11</td>
<td>13</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">07 ,Burglary and related offences</th>
<td>177</td>
<td>190</td>
<td>157</td>
<td>140</td>
<td>151</td>
<td>139</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">08 ,Theft and related offences</th>
<td>510</td>
<td>466</td>
<td>495</td>
<td>542</td>
<td>445</td>
<td>302</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">09 ,Fraud, deception and related offences</th>
<td>66</td>
<td>76</td>
<td>126</td>
<td>114</td>
<td>98</td>
<td>66</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">10 ,Controlled drug offences</th>
<td>113</td>
<td>100</td>
<td>64</td>
<td>55</td>
<td>44</td>
<td>80</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">11 ,Weapons and Explosives Offences</th>
<td>22</td>
<td>18</td>
<td>13</td>
<td>10</td>
<td>19</td>
<td>17</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">12 ,Damage to property and to the environment</th>
<td>257</td>
<td>266</td>
<td>269</td>
<td>203</td>
<td>213</td>
<td>177</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">13 ,Public order and other social code offences</th>
<td>168</td>
<td>115</td>
<td>93</td>
<td>78</td>
<td>79</td>
<td>92</td>
</tr>
<tr align="RIGHT">
<th align="LEFT">15 ,Offences against government, justice procedures and organisation of crime</th>
<td>45</td>
<td>48</td>
<td>39</td>
<td>39</td>
<td>66</td>
<td>50</td>
</tr>
<tr align="LEFT">
<td colspan="8"><a href="http://www.cso.ie/en/methods/crime/recordedcrime/">See Background Notes</a>
</td>
</tr>
</tbody></table>
</body></html>
我目前提出的代码可以打印出这样的数字
Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
... (Figures 11-66 omitted for conciseness)
Figure 67 : 48
Figure 68 : 39
Figure 69 : 39
Figure 70 : 66
Figure 71 : 50
然而我喜欢它的显示方式更像是
Crime: 03 ,Attempts/threats to murder, assaults, harassments and related offences
Figure 0 : 96
Figure 1 : 89
Figure 2 : 70
Figure 3 : 97
Figure 4 : 103
Figure 5 : 103
Crime: 04 ,Dangerous or negligent acts
Figure 6 : 72
Figure 7 : 67
Figure 8 : 50
Figure 9 : 53
Figure 10 : 45
etc, etc
我尝试了许多不同的方法,例如添加一个for循环来访问带有犯罪的th元素,然后另一个用数字访问td元素,但这通常会导致像
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
工作解析器类
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ParseCrimeStatistics {
public static void main(String[]args) {
try {
int count = 0;
File input = new File("Balbriggan.html");
Document doc =Jsoup.parse(input, "UTF-8", "http://www.cso.ie");
Elements title = doc.select("td");
for(Element sectd1:title){
Elements ths = sectd1.select("td");
String result = ths.get(0).text();
System.out.println("Figure " + count + " : "+ result);
count++;
}
}catch (IOException e) {
e.printStackTrace();
}
}
}
有人会对我如何解决这个问题有任何建议吗?谢谢。
答案 0 :(得分:2)
试试这个,
int count = 0;
File input = new File("Balbriggan.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://www.cso.ie");
Elements numbers = doc.select("td");
Elements titles = doc.select("th");
for(int i=9; i<titles.size(); i++)
{
System.out.println("Crime: " + titles.get(i).text());
for(int j=0; j<6; j++)
{
System.out.println("Figure " + count + ":" + numbers.get((i-9)*6+j).text());
count++;
}
}