假设我正在运行一项服务,用户可以提交正则表达式来搜索大量数据。如果用户提交一个非常慢的正则表达式(即,需要几分钟才能返回Matcher.find()),我想要一种方法来取消该匹配。我能想到这样做的唯一方法是让另一个线程监视匹配的持续时间,并在必要时使用Thread.stop()取消它。
成员变量:
long REGEX_TIMEOUT = 30000L;
Object lock = new Object();
boolean finished = false;
Thread matcherThread;
匹配线程:
try {
matcherThread = Thread.currentThread();
// imagine code to start monitor thread is here
try {
matched = matcher.find();
} finally {
synchronized (lock) {
finished = true;
lock.notifyAll();
}
}
} catch (ThreadDeath td) {
// send angry message to client
// handle error without rethrowing td
}
监控线程:
synchronized (lock) {
while (! finished) {
try {
lock.wait(REGEX_TIMEOUT);
if (! finished) {
matcherThread.stop();
}
} catch (InterruptedException ex) {
// ignore, top level method in dedicated thread, etc..
}
}
}
我已经阅读了java.sun.com/j2se/1.4.2/docs/guide/misc/threadPrimitiveDeprecation.html,我认为这种用法是安全的,因为我控制ThreadDeath通过同步抛出的位置并处理它唯一受损的对象可能是我的Pattern和Matcher实例,无论如何都会被丢弃。我认为这会破坏Thread.stop(),因为我不会重新抛出错误,但我不想让线程死掉,只是中止find()方法。
到目前为止,我已设法避免使用这些已弃用的API组件,但Matcher.find()似乎不可中断,并且可能需要很长时间才能返回。有没有更好的方法呢?
答案 0 :(得分:41)
来自Heritrix:(crawler.archive.org)
/**
* CharSequence that noticed thread interrupts -- as might be necessary
* to recover from a loose regex on unexpected challenging input.
*
* @author gojomo
*/
public class InterruptibleCharSequence implements CharSequence {
CharSequence inner;
// public long counter = 0;
public InterruptibleCharSequence(CharSequence inner) {
super();
this.inner = inner;
}
public char charAt(int index) {
if (Thread.interrupted()) { // clears flag if set
throw new RuntimeException(new InterruptedException());
}
// counter++;
return inner.charAt(index);
}
public int length() {
return inner.length();
}
public CharSequence subSequence(int start, int end) {
return new InterruptibleCharSequence(inner.subSequence(start, end));
}
@Override
public String toString() {
return inner.toString();
}
}
用这一个包裹你的CharSequence并且线程中断将起作用......
答案 1 :(得分:4)
稍微改变一下就可以避免使用额外的线程:
public class RegularExpressionUtils {
// demonstrates behavior for regular expression running into catastrophic backtracking for given input
public static void main(String[] args) {
Matcher matcher = createMatcherWithTimeout(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 2000);
System.out.println(matcher.matches());
}
public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, int timeoutMillis) {
Pattern pattern = Pattern.compile(regularExpression);
return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis);
}
public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern, int timeoutMillis) {
CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
regularExpressionPattern.pattern());
return regularExpressionPattern.matcher(charSequence);
}
private static class TimeoutRegexCharSequence implements CharSequence {
private final CharSequence inner;
private final int timeoutMillis;
private final long timeoutTime;
private final String stringToMatch;
private final String regularExpression;
public TimeoutRegexCharSequence(CharSequence inner, int timeoutMillis, String stringToMatch, String regularExpression) {
super();
this.inner = inner;
this.timeoutMillis = timeoutMillis;
this.stringToMatch = stringToMatch;
this.regularExpression = regularExpression;
timeoutTime = System.currentTimeMillis() + timeoutMillis;
}
public char charAt(int index) {
if (System.currentTimeMillis() > timeoutTime) {
throw new RuntimeException("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
+ regularExpression + "' on input '" + stringToMatch + "'!");
}
return inner.charAt(index);
}
public int length() {
return inner.length();
}
public CharSequence subSequence(int start, int end) {
return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch, regularExpression);
}
@Override
public String toString() {
return inner.toString();
}
}
}
非常感谢你指点我这个解决方案来回答一个不必要的复杂question!
答案 2 :(得分:0)
另一种解决方法是限制匹配器的region,然后调用find()
,重复直到线程被中断或找到匹配。
答案 3 :(得分:0)
您可能需要的是一个实现NFA算法的新库。
NFA算法比Java标准库使用的算法快数百倍。
Java std lib对输入regexp很敏感,这可能会让你的问题发生 - 有些输入使CPU运行多年。
NFA算法可以通过它使用的步骤设置超时。它比Thread解决方案有效。相信我,我使用线程超时来解决相对问题,这对性能来说太可怕了。我最后通过修改算法实现的主循环来解决问题。我在主循环中插入一些检查点来测试时间。
答案 4 :(得分:0)
为了减少开销,我添加了一个计数器来检查charAt的每n次读取。
注意:
有些人说carAt的调用频率可能不够高。我只是添加了foo变量,以便演示调用了多少charAt,并且它足够频繁。如果要在生产环境中使用它,请删除该计数器,因为它会降低性能,并且长时间在服务器中运行会导致长时间溢出。在此示例中,每0.8秒左右调用一次charAt 3000万次(未经适当的微基准测试条件测试,这仅仅是概念上的证明)。如果需要更高的精度,则可以设置较低的checkInterval,以牺牲性能为代价(从长远来看,System.currentTimeMillis()> timeoutTime比if子句昂贵。
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.goikosoft.test.RegexpTimeoutException;
/**
* Allows to create timeoutable regular expressions.
*
* Limitations: Can only throw RuntimeException. Decreases performance.
*
* Posted by Kris in stackoverflow.
*
* Modified by dgoiko to ejecute timeout check only every n chars.
* Now timeout < 0 means no timeout.
*
* @author Kris https://stackoverflow.com/a/910798/9465588
*
*/
public class RegularExpressionUtils {
public static long foo = 0;
// demonstrates behavior for regular expression running into catastrophic backtracking for given input
public static void main(String[] args) {
long millis = System.currentTimeMillis();
// This checkInterval produces a < 500 ms delay. Higher checkInterval will produce higher delays on timeout.
Matcher matcher = createMatcherWithTimeout(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 10000, 30000000);
try {
System.out.println(matcher.matches());
} catch (RuntimeException e) {
System.out.println("Operation timed out after " + (System.currentTimeMillis() - millis) + " milliseconds");
}
System.out.print(foo);
}
public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, long timeoutMillis,
int checkInterval) {
Pattern pattern = Pattern.compile(regularExpression);
return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis, checkInterval);
}
public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern,
long timeoutMillis, int checkInterval) {
if (timeoutMillis < 0) {
return regularExpressionPattern.matcher(stringToMatch);
}
CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
regularExpressionPattern.pattern(), checkInterval);
return regularExpressionPattern.matcher(charSequence);
}
private static class TimeoutRegexCharSequence implements CharSequence {
private final CharSequence inner;
private final long timeoutMillis;
private final long timeoutTime;
private final String stringToMatch;
private final String regularExpression;
private int checkInterval;
private int attemps;
TimeoutRegexCharSequence(CharSequence inner, long timeoutMillis, String stringToMatch,
String regularExpression, int checkInterval) {
super();
this.inner = inner;
this.timeoutMillis = timeoutMillis;
this.stringToMatch = stringToMatch;
this.regularExpression = regularExpression;
timeoutTime = System.currentTimeMillis() + timeoutMillis;
this.checkInterval = checkInterval;
this.attemps = 0;
}
public char charAt(int index) {
if (this.attemps == this.checkInterval) {
foo++;
if (System.currentTimeMillis() > timeoutTime) {
throw new RegexpTimeoutException(regularExpression, stringToMatch, timeoutMillis);
}
this.attemps = 0;
} else {
this.attemps++;
}
return inner.charAt(index);
}
public int length() {
return inner.length();
}
public CharSequence subSequence(int start, int end) {
return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch,
regularExpression, checkInterval);
}
@Override
public String toString() {
return inner.toString();
}
}
}
还有自定义异常,因此您只能捕获该异常,以避免使其他RE模式/匹配器可能抛出的异常。
public class RegexpTimeoutException extends RuntimeException {
private static final long serialVersionUID = 6437153127902393756L;
private final String regularExpression;
private final String stringToMatch;
private final long timeoutMillis;
public RegexpTimeoutException() {
super();
regularExpression = null;
stringToMatch = null;
timeoutMillis = 0;
}
public RegexpTimeoutException(String message, Throwable cause) {
super(message, cause);
regularExpression = null;
stringToMatch = null;
timeoutMillis = 0;
}
public RegexpTimeoutException(String message) {
super(message);
regularExpression = null;
stringToMatch = null;
timeoutMillis = 0;
}
public RegexpTimeoutException(Throwable cause) {
super(cause);
regularExpression = null;
stringToMatch = null;
timeoutMillis = 0;
}
public RegexpTimeoutException(String regularExpression, String stringToMatch, long timeoutMillis) {
super("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
+ regularExpression + "' on input '" + stringToMatch + "'!");
this.regularExpression = regularExpression;
this.stringToMatch = stringToMatch;
this.timeoutMillis = timeoutMillis;
}
public String getRegularExpression() {
return regularExpression;
}
public String getStringToMatch() {
return stringToMatch;
}
public long getTimeoutMillis() {
return timeoutMillis;
}
}
基于Andreas' answer。主要的功劳应该归功于他及其来源。
答案 5 :(得分:0)
在使用一个或多个正则表达式模式执行之前,检查用户提交的正则表达式中是否存在“邪恶”模式(这可能是在有条件执行正则表达式之前调用的一种方法):
此正则表达式:
\(.+\+\)[\+\*]
将匹配:
(a+)+
(ab+)+
([a-zA-Z]+)*
此正则表达式:
\((.+)\|(\1\?|\1{2,})\)\+
将匹配:
(a|aa)+
(a|a?)+
此正则表达式:
\(\.\*.\)\{\d{2,}\}
将匹配:
(.*a){x} for x \> 10
我对Regex和Regex DoS有点天真,但是我不禁认为,对已知的“邪恶”模式进行一些预筛选将大大有助于防止执行时出现问题,特别是如果正则表达式是最终用户提供的输入。由于我距离正则表达式专家还很远,因此上面的模式可能还不够完善。这只是思考的结果,因为我发现的所有其他内容似乎都表明它无法完成,并且着重于使regex引擎超时或限制允许执行的迭代次数
答案 6 :(得分:-1)
可以使用以下方法停止长时间运行的模式匹配过程。
StateFulCharSequence
类,该类管理模式匹配的状态。更改状态后,下次调用charAt
方法时将引发异常。ScheduledExecutorService
安排状态更改,并设置所需的超时时间。这里的模式匹配发生在主线程中,不需要每次都检查线程中断状态。
public class TimedPatternMatcher {
public static void main(String[] args) {
ScheduledExecutorService executorService = Executors.newScheduledThreadPool(1);
Pattern pattern = Pattern.compile("some regex pattern");
StateFulCharSequence stateFulCharSequence = new StateFulCharSequence("some character sequence");
Matcher matcher = pattern.matcher(stateFulCharSequence);
executorService.schedule(stateFulCharSequence, 10, TimeUnit.MILLISECONDS);
try {
boolean isMatched = matcher.find();
}catch (Exception e) {
e.printStackTrace();
}
}
/*
When this runnable is executed, it will set timeOut to true and pattern matching is stopped by throwing exception.
*/
public static class StateFulCharSequence implements CharSequence, Runnable{
private CharSequence inner;
private boolean isTimedOut = false;
public StateFulCharSequence(CharSequence inner) {
super();
this.inner = inner;
}
public char charAt(int index) {
if (isTimedOut) {
throw new RuntimeException(new TimeoutException("Pattern matching timeout occurs"));
}
return inner.charAt(index);
}
@Override
public int length() {
return inner.length();
}
@Override
public CharSequence subSequence(int start, int end) {
return new com.adventnet.la.fieldgen.StateFulCharSequence(inner.subSequence(start, end));
}
@Override
public String toString() {
return inner.toString();
}
public void setTimedOut() {
this.isTimedOut = true;
}
@Override
public void run() {
this.isTimedOut = true;
}
}}