所以我正在研究java 1.5中的项目,该项目将玩家数据从NFL的网页移动到个人数据存储中。我将页面的源代码转换为字符串格式,并将其解析为我想要提取的数据。我能够获得第一块经常格式化的播放器信息,但我正在努力格式化我的模式以适应一些异常结构化的空白。注释从它停止正确解析的地方开始。
pattern = Pattern.compile(
sTag + "(.*?)" + eTag + "\n"//position 1-group
+sTag + "(.*?)" + eTag + "\n" //number 2
+ "<td><a href=\"(.*?)/profile\">(.*?)</a>" + eTag + "\n" //name 4 (3 not used)
+sTag + "(.*?)" + eTag + "\n" //active status 5
// +"(.*?)" //6
// +sTag + "(.*?)" + eTag + "\n" //tackles 7
// +"(.*?) //8
// +sTag + "(.*?)" + eTag //sacks 9
// +"(.*?) //10
// +sTag + "(.*?)" + eTag //ff 11 (not used)
// +"(.*?) //12
// +sTag + "(.*?)" + eTag //int 13
);
我尝试解析的HTML数据格式如下:
<td class="tbdy1"><a href="/teams/atlantafalcons/profile?team=ATL">ATL</a></td></tr>
<tr class="even">
<td class="tbdy">SS</td>
<td class="tbdy">20</td>
<td><a href="/player/willallen/2506088/profile">Allen, Will</a></td>
<td class="tbdy">ACT</td>
<td class="ra">
TCKL
</td>
<td class="tbdy">36</td>
<td class="ra">
SCK
</td>
<td class="tbdy">0.0</td>
<td class="ra">
FF
</td>
<td class="tbdy">1</td>
<td class="ra">
INT
</td>
<td class="tbdy">--</td>
有任何帮助吗?
答案 0 :(得分:0)
经过一番挖掘,我决定以不同的方式解决问题。 Removing whitespace from strings in Java的论坛向我展示了如何消除所有的空白。这使得模式识别变得更加容易。我的最终设置最终看起来像这样:
line = line.replaceAll("\\s", "");
String sTag = "<tdclass=\"tbdy\">";
String eTag = "</td>";
Pattern pattern;
Matcher matcher;
pattern = Pattern.compile(
// pattern //stat group#
sTag + "(.*?)" + eTag //position 1
+sTag + "(.*?)" + eTag //number 2
+ "<td><ahref=\"(.*?)/profile\">(.*?)</a>" + eTag //name 4 (3 not used)
+sTag + "(.*?)" + eTag //status 5
+"(.*?)" //6
+sTag + "(.*?)" + eTag //tackles 7
+"(.*?)" //8
+sTag + "(.*?)" + eTag //sacks 9
+"(.*?)" //10
+sTag + "(.*?)" + eTag //ff 11 (not used)
+"(.*?)" //12
+sTag + "(.*?)" + eTag //int 13
);
System.out.println(" " + matcher.group(1) +" "+ matcher.group(2) + " " + matcher.group(4)+" "+ matcher.group(5)+ " " + matcher.group(7)+ " " + matcher.group(9)+ " " + matcher.group(13));