Question

我有一些代码接收URL，读取文件并搜索与给定正则表达式匹配的字符串，并将任何匹配添加到arrayList，直到它到达文件末尾。如何修改我的代码，以便在阅读文件时，我可以检查其他字符串是否与同一遍中的其他正则表达式匹配，而不是必须多次读取文件以检查每个不同的正则表达式？

    //Pattern currently being checked for
    Pattern name = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>");

    //Pattern I want to check for as well, currently not implemented
    Pattern date = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}");

    Matcher m;
    InputStream inputStream = null;
    arrayList = new ArrayList<String>();
    try {
        URL url = new URL(
                "URL to be read");
        inputStream = (InputStream) url.getContent();
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        InputStreamReader isr = new InputStreamReader(inputStream);
        BufferedReader buf = new BufferedReader(isr);
        String str = null;
        String s = null;

        try {
            while ((str = buf.readLine()) != null) {

                m = name.matcher(str);
                while(m.find()){
                    s = m.group();
                    arrayList.add(s);
                }

            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

Answer 1

从2个Matchers开始，您应该使用List。如果其中一个流失败，则不应在输入的finally块中执行此操作。相反，应该使用finally块来关闭资源。

    List <Pattern> patterns = new ArrayList <Pattern> ();
    //Pattern currently being checked for
    patterns.add (Pattern.compile ("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>"));
    //Pattern I want to check for as well, currently not implemented
    patterns.add (Pattern.compile ("[0-9]{2}/[0-9]{2}/[0-9]{4}"));
    BufferedReader buf = null;
    List <String> matches = new ArrayList <String> ();
    try {
        URL url = new URL ("URL to be read");
        InputStream inputStream = (InputStream) url.getContent ();
        InputStreamReader isr = new InputStreamReader (inputStream);
        buf = new BufferedReader (isr);
        String str = null;
        while ((str = buf.readLine ()) != null) 
        {
            for (Pattern p : patterns) 
            {
                Matcher m = p.matcher (str);
                while (m.find ()) 
                    matches.add (m.group ());
            }
        }       
    } 
    catch (Exception e) 
    {
        e.printStackTrace();
    }
    finally  
    {
        if (buf != null) 
            try { buf.close (); } catch (IOException ignored) { /*empty*/}
    }

代码中未更正：您应该枚举特定的异常，而不是“异常”。 Matcher只是在最里面的循环中使用，因此在那里声明它，而不是在更大的范围内。小范围可以很容易地推断出变量的使用。

我不确定util.Scanner.class是否可以用来更简单地从Url读取。看看文档。

Answer 2

使用一个了解如何正确解析HTML的java库，而不是使用正则表达式。

例如，请查看以下内容的答案：Java HTML Parsing

Answer 3

只需为其他模式获取新的匹配器

   Matcher m2 = date.matcher(str);
   ... // do whatever you want to do with this pattern match

顺便说一句，通常用正则表达式解析HTML并不是一个非常好的主意。（ob. link, by Assistant Don't Parse HTML With Regex Officer in charge）

Answer 4

创建两个Matcher个对象

//Pattern currently being checked for
Matcher nameMatcher = Pattern.compile("<a id=.dg__ct(.+?)_hpl1.>(.+?)</a>").matcher();

//Pattern I want to check for as well, currently not implemented
Matcher dateMatcher = Pattern.compile("[0-9]{2}/[0-9]{2}/[0-9]{4}").matcher();


// other stuff...

检查每个匹配器的读取字符串

while ((str = buf.readLine()) != null) {

        nameMatcher.reset(str);

        while(nameMatcher.find()){
            s = nameMatcher.group();
            arrayList.add(s);
        }

        dateMatcher.reset(str);

        while(nameMatcher.find()){
            s = nameMatcher.group();
            arrayList.add(s);
        }
    }

重要

每次使用reset(Charsequence)而不是分配新的Matcher对象。

使用多个正则表达式扫描文件

4 个答案: