我有一些代码接收URL,读取文件并搜索与给定正则表达式匹配的字符串,并将任何匹配添加到arrayList,直到它到达文件末尾。如何修改我的代码,以便在阅读文件时,我可以检查其他字符串是否与同一遍中的其他正则表达式匹配,而不是必须多次读取文件以检查每个不同的正则表达式?
//Pattern currently being checked for
Pattern name = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>");
//Pattern I want to check for as well, currently not implemented
Pattern date = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}");
Matcher m;
InputStream inputStream = null;
arrayList = new ArrayList<String>();
try {
URL url = new URL(
"URL to be read");
inputStream = (InputStream) url.getContent();
} catch (Exception e) {
e.printStackTrace();
} finally {
InputStreamReader isr = new InputStreamReader(inputStream);
BufferedReader buf = new BufferedReader(isr);
String str = null;
String s = null;
try {
while ((str = buf.readLine()) != null) {
m = name.matcher(str);
while(m.find()){
s = m.group();
arrayList.add(s);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
答案 0 :(得分:6)
从2个Matchers开始,您应该使用List。如果其中一个流失败,则不应在输入的finally块中执行此操作。相反,应该使用finally块来关闭资源。
List <Pattern> patterns = new ArrayList <Pattern> ();
//Pattern currently being checked for
patterns.add (Pattern.compile ("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>"));
//Pattern I want to check for as well, currently not implemented
patterns.add (Pattern.compile ("[0-9]{2}/[0-9]{2}/[0-9]{4}"));
BufferedReader buf = null;
List <String> matches = new ArrayList <String> ();
try {
URL url = new URL ("URL to be read");
InputStream inputStream = (InputStream) url.getContent ();
InputStreamReader isr = new InputStreamReader (inputStream);
buf = new BufferedReader (isr);
String str = null;
while ((str = buf.readLine ()) != null)
{
for (Pattern p : patterns)
{
Matcher m = p.matcher (str);
while (m.find ())
matches.add (m.group ());
}
}
}
catch (Exception e)
{
e.printStackTrace();
}
finally
{
if (buf != null)
try { buf.close (); } catch (IOException ignored) { /*empty*/}
}
代码中未更正:您应该枚举特定的异常,而不是“异常”。 Matcher只是在最里面的循环中使用,因此在那里声明它,而不是在更大的范围内。小范围可以很容易地推断出变量的使用。
我不确定util.Scanner.class是否可以用来更简单地从Url读取。看看文档。
答案 1 :(得分:2)
使用一个了解如何正确解析HTML的java库,而不是使用正则表达式。
例如,请查看以下内容的答案:Java HTML Parsing
答案 2 :(得分:1)
只需为其他模式获取新的匹配器
Matcher m2 = date.matcher(str);
... // do whatever you want to do with this pattern match
顺便说一句,通常用正则表达式解析HTML并不是一个非常好的主意。
(ob. link, by Assistant Don't Parse HTML With Regex Officer in charge)
答案 3 :(得分:1)
创建两个Matcher
个对象
//Pattern currently being checked for
Matcher nameMatcher = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>").matcher();
//Pattern I want to check for as well, currently not implemented
Matcher dateMatcher = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}").matcher();
// other stuff...
检查每个匹配器的读取字符串
while ((str = buf.readLine()) != null) {
nameMatcher.reset(str);
while(nameMatcher.find()){
s = nameMatcher.group();
arrayList.add(s);
}
dateMatcher.reset(str);
while(nameMatcher.find()){
s = nameMatcher.group();
arrayList.add(s);
}
}
重要强>
每次使用reset(Charsequence)
而不是分配新的Matcher对象。