取消长期运行的正则表达式匹配?

时间:2009-05-26 13:40:28

标签: java regex multithreading

假设我正在运行一项服务,用户可以提交正则表达式来搜索大量数据。如果用户提交一个非常慢的正则表达式(即,需要几分钟才能返回Matcher.find()),我想要一种方法来取消该匹配。我能想到这样做的唯一方法是让另一个线程监视匹配的持续时间,并在必要时使用Thread.stop()取消它。

成员变量:

long REGEX_TIMEOUT = 30000L;
Object lock = new Object();
boolean finished = false;
Thread matcherThread;

匹配线程:

try {
    matcherThread = Thread.currentThread();

    // imagine code to start monitor thread is here

    try {
        matched = matcher.find();
    } finally {
        synchronized (lock) {
            finished = true;
            lock.notifyAll();
        }
    }
} catch (ThreadDeath td) {
    // send angry message to client
    // handle error without rethrowing td
}

监控线程:

synchronized (lock) {
    while (! finished) {
        try {
            lock.wait(REGEX_TIMEOUT);

            if (! finished) {
                matcherThread.stop();
            }
        } catch (InterruptedException ex) {
            // ignore, top level method in dedicated thread, etc..
        }
    }
}

我已经阅读了java.sun.com/j2se/1.4.2/docs/guide/misc/threadPrimitiveDeprecation.html,我认为这种用法是安全的,因为我控制ThreadDeath通过同步抛出的位置并处理它唯一受损的对象可能是我的Pattern和Matcher实例,无论如何都会被丢弃。我认为这会破坏Thread.stop(),因为我不会重新抛出错误,但我不想让线程死掉,只是中止find()方法。

到目前为止,我已设法避免使用这些已弃用的API组件,但Matcher.find()似乎不可中断,并且可能需要很长时间才能返回。有没有更好的方法呢?

7 个答案:

答案 0 :(得分:41)

来自Heritrix:(crawler.archive.org

/**
 * CharSequence that noticed thread interrupts -- as might be necessary 
 * to recover from a loose regex on unexpected challenging input. 
 * 
 * @author gojomo
 */
public class InterruptibleCharSequence implements CharSequence {
    CharSequence inner;
    // public long counter = 0; 

    public InterruptibleCharSequence(CharSequence inner) {
        super();
        this.inner = inner;
    }

    public char charAt(int index) {
        if (Thread.interrupted()) { // clears flag if set
            throw new RuntimeException(new InterruptedException());
        }
        // counter++;
        return inner.charAt(index);
    }

    public int length() {
        return inner.length();
    }

    public CharSequence subSequence(int start, int end) {
        return new InterruptibleCharSequence(inner.subSequence(start, end));
    }

    @Override
    public String toString() {
        return inner.toString();
    }
}

用这一个包裹你的CharSequence并且线程中断将起作用......

答案 1 :(得分:4)

稍微改变一下就可以避免使用额外的线程:

public class RegularExpressionUtils {

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 2000);
        System.out.println(matcher.matches());
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, int timeoutMillis) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern, int timeoutMillis) {
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern());
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final int timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        public TimeoutRegexCharSequence(CharSequence inner, int timeoutMillis, String stringToMatch, String regularExpression) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
        }

        public char charAt(int index) {
            if (System.currentTimeMillis() > timeoutTime) {
                throw new RuntimeException("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                                + regularExpression + "' on input '" + stringToMatch + "'!");
            }
            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch, regularExpression);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

非常感谢你指点我这个解决方案来回答一个不必要的复杂question

答案 2 :(得分:0)

另一种解决方法是限制匹配器的region,然后调用find(),重复直到线程被中断或找到匹配。

答案 3 :(得分:0)

您可能需要的是一个实现NFA算法的新库。

NFA算法比Java标准库使用的算法快数百倍。

Java std lib对输入regexp很敏感,这可能会让你的问题发生 - 有些输入使CPU运行多年。

NFA算法可以通过它使用的步骤设置超时。它比Thread解决方案有效。相信我,我使用线程超时来解决相​​对问题,这对性能来说太可怕了。我最后通过修改算法实现的主循环来解决问题。我在主循环中插入一些检查点来测试时间。

详情可在此处找到:https://swtch.com/~rsc/regexp/regexp1.html

答案 4 :(得分:0)

为了减少开销,我添加了一个计数器来检查charAt的每n次读取。

注意:

有些人说carAt的调用频率可能不够高。我只是添加了foo变量,以便演示调用了多少charAt,并且它足够频繁。如果要在生产环境中使用它,请删除该计数器,因为它会降低性能,并且长时间在服务器中运行会导致长时间溢出。在此示例中,每0.8秒左右调用一次charAt 3000万次(未经适当的微基准测试条件测试,这仅仅是概念上的证明)。如果需要更高的精度,则可以设置较低的checkInterval,以牺牲性能为代价(从长远来看,System.currentTimeMillis()> timeoutTime比if子句昂贵。

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.goikosoft.test.RegexpTimeoutException;

/**
 * Allows to create timeoutable regular expressions.
 *
 * Limitations: Can only throw RuntimeException. Decreases performance.
 *
 * Posted by Kris in stackoverflow.
 *
 * Modified by dgoiko to  ejecute timeout check only every n chars.
 * Now timeout < 0 means no timeout.
 *
 * @author Kris https://stackoverflow.com/a/910798/9465588
 *
 */
public class RegularExpressionUtils {

    public static long foo = 0;

    // demonstrates behavior for regular expression running into catastrophic backtracking for given input
    public static void main(String[] args) {
        long millis = System.currentTimeMillis();
        // This checkInterval produces a < 500 ms delay. Higher checkInterval will produce higher delays on timeout.
        Matcher matcher = createMatcherWithTimeout(
                "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 10000, 30000000);
        try {
            System.out.println(matcher.matches());
        } catch (RuntimeException e) {
            System.out.println("Operation timed out after " + (System.currentTimeMillis() - millis) + " milliseconds");
        }
        System.out.print(foo);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, long timeoutMillis,
                                                      int checkInterval) {
        Pattern pattern = Pattern.compile(regularExpression);
        return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis, checkInterval);
    }

    public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern,
                                                    long timeoutMillis, int checkInterval) {
        if (timeoutMillis < 0) {
            return regularExpressionPattern.matcher(stringToMatch);
        }
        CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
                regularExpressionPattern.pattern(), checkInterval);
        return regularExpressionPattern.matcher(charSequence);
    }

    private static class TimeoutRegexCharSequence implements CharSequence {

        private final CharSequence inner;

        private final long timeoutMillis;

        private final long timeoutTime;

        private final String stringToMatch;

        private final String regularExpression;

        private int checkInterval;

        private int attemps;

        TimeoutRegexCharSequence(CharSequence inner, long timeoutMillis, String stringToMatch,
                                  String regularExpression, int checkInterval) {
            super();
            this.inner = inner;
            this.timeoutMillis = timeoutMillis;
            this.stringToMatch = stringToMatch;
            this.regularExpression = regularExpression;
            timeoutTime = System.currentTimeMillis() + timeoutMillis;
            this.checkInterval = checkInterval;
            this.attemps = 0;
        }

        public char charAt(int index) {
            if (this.attemps == this.checkInterval) {
                foo++;
                if (System.currentTimeMillis() > timeoutTime) {
                    throw new RegexpTimeoutException(regularExpression, stringToMatch, timeoutMillis);
                }
                this.attemps = 0;
            } else {
                this.attemps++;
            }

            return inner.charAt(index);
        }

        public int length() {
            return inner.length();
        }

        public CharSequence subSequence(int start, int end) {
            return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch,
                                                regularExpression, checkInterval);
        }

        @Override
        public String toString() {
            return inner.toString();
        }
    }

}

还有自定义异常,因此您只能捕获该异常,以避免使其他RE模式/匹配器可能抛出的异常。

public class RegexpTimeoutException extends RuntimeException {
    private static final long serialVersionUID = 6437153127902393756L;

    private final String regularExpression;

    private final String stringToMatch;

    private final long timeoutMillis;

    public RegexpTimeoutException() {
        super();
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message, Throwable cause) {
        super(message, cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String message) {
        super(message);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(Throwable cause) {
        super(cause);
        regularExpression = null;
        stringToMatch = null;
        timeoutMillis = 0;
    }

    public RegexpTimeoutException(String regularExpression, String stringToMatch, long timeoutMillis) {
        super("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
                + regularExpression + "' on input '" + stringToMatch + "'!");
        this.regularExpression = regularExpression;
        this.stringToMatch = stringToMatch;
        this.timeoutMillis = timeoutMillis;
    }

    public String getRegularExpression() {
        return regularExpression;
    }

    public String getStringToMatch() {
        return stringToMatch;
    }

    public long getTimeoutMillis() {
        return timeoutMillis;
    }

}

基于Andreas' answer。主要的功劳应该归功于他及其来源。

答案 5 :(得分:0)

在使用一个或多个正则表达式模式执行之前,检查用户提交的正则表达式中是否存在“邪恶”模式(这可能是在有条件执行正则表达式之前调用的一种方法):

此正则表达式:

\(.+\+\)[\+\*]

将匹配:

(a+)+
(ab+)+
([a-zA-Z]+)*

此正则表达式:

\((.+)\|(\1\?|\1{2,})\)\+

将匹配:

(a|aa)+
(a|a?)+

此正则表达式:

\(\.\*.\)\{\d{2,}\}

将匹配:

(.*a){x} for x \> 10

我对Regex和Regex DoS有点天真,但是我不禁认为,对已知的“邪恶”模式进行一些预筛选将大大有助于防止执行时出现问题,特别是如果正则表达式是最终用户提供的输入。由于我距离正则表达式专家还很远,因此上面的模式可能还不够完善。这只是思考的结果,因为我发现的所有其他内容似乎都表明它无法完成,并且着重于使regex引擎超时或限制允许执行的迭代次数

答案 6 :(得分:-1)

可以使用以下方法停止长时间运行的模式匹配过程。

  • 创建StateFulCharSequence类,该类管理模式匹配的状态。更改状态后,下次调用charAt方法时将引发异常。
  • 可以使用ScheduledExecutorService安排状态更改,并设置所需的超时时间。
  • 这里的模式匹配发生在主线程中,不需要每次都检查线程中断状态。

    public class TimedPatternMatcher {
    public static void main(String[] args) {
        ScheduledExecutorService executorService = Executors.newScheduledThreadPool(1);
        Pattern pattern = Pattern.compile("some regex pattern");
        StateFulCharSequence stateFulCharSequence = new StateFulCharSequence("some character sequence");
        Matcher matcher = pattern.matcher(stateFulCharSequence);
        executorService.schedule(stateFulCharSequence, 10, TimeUnit.MILLISECONDS);
        try {
            boolean isMatched = matcher.find();
        }catch (Exception e) {
            e.printStackTrace();
        }
    
    }
    
    /*
    When this runnable is executed, it will set timeOut to true and pattern matching is stopped by throwing exception.
     */
    public static class StateFulCharSequence implements CharSequence, Runnable{
        private CharSequence inner;
    
        private boolean isTimedOut = false;
    
        public StateFulCharSequence(CharSequence inner) {
            super();
            this.inner = inner;
        }
    
        public char charAt(int index) {
            if (isTimedOut) {
                throw new RuntimeException(new TimeoutException("Pattern matching timeout occurs"));
            }
            return inner.charAt(index);
        }
    
        @Override
        public int length() {
            return inner.length();
        }
    
        @Override
        public CharSequence subSequence(int start, int end) {
            return new com.adventnet.la.fieldgen.StateFulCharSequence(inner.subSequence(start, end));
        }
    
        @Override
        public String toString() {
            return inner.toString();
        }
    
        public void setTimedOut() {
            this.isTimedOut = true;
        }
    
        @Override
        public void run() {
            this.isTimedOut = true;
        }
    }}