使用PHP将用户输入的全文搜索查询解析为MySQL的WHERE子句

时间:2010-07-06 10:02:23

标签: php mysql full-text-search

我想将用户输入的FTS查询转换为MySQL的WHERE子句。所以功能将类似Gmail的搜索。因此用户可以输入:

from:me AND (to:john OR to:jenny) dinner

虽然我认为它不重要,但表结构将类似于:

Message
 - id
 - from
 - to
 - title
 - description
 - time_created

MessageComment
 - id
 - message_id
 - comment
 - time_created

由于这是一个常见问题,我认为可能已有解决方案。有没有?

P.S。有一个类似的问题,如here,但它适用于SQL Server。

2 个答案:

答案 0 :(得分:3)

以下代码由Tokenizer,Token和QueryBuilder类组成。 它可能不是有史以来最优雅的解决方案,但它确实可以满足您的要求:

<?
// QueryBuilder Grammar:
// =====================
// SearchRule       := SimpleSearchRule { KeyWord }
// SimpleSearchRule := Expression | SimpleSearchRule { 'OR' Expression }
// Expression       := SimpleExpression | Expression { 'AND' SimpleExpression }
// SimpleExpression := '(' SimpleSearchRule ')' | FieldExpression

$input = 'from:me AND (to:john OR to:jenny) dinner party';

$fieldMapping = array(
    'id' => 'id',
    'from' => 'from',
    'to' => 'to',
    'title' => 'title',
    'description' => 'description',
    'time_created' => 'time_created'
);
$fullTextFields = array('title','description');

$qb = new QueryBuilder($fieldMapping, $fullTextFields);
try {
    echo $qb->parseSearchRule($input);
} catch(Exception $error) {
    echo 'Error occurred while parsing search query: <br/>'.$error->getMessage();
}

class Token {
    const   KEYWORD = 'KEYWORD',
            OPEN_PAR='OPEN_PAR',
            CLOSE_PAR='CLOSE_PAR',
            FIELD='FIELD',
            AND_OP='AND_OP',
            OR_OP='OR_OP';
    public $type;
    public $chars;
    public $position;

    function __construct($type,$chars,$position) {
        $this->type = $type;
        $this->chars = $chars;
        $this->position = $position;
    }

    function __toString() {
        return 'Token[ type='.$this->type.', chars='.$this->chars.', position='.$this->position.' ]';
    }
}

class Tokenizer {
    private $tokens = array();
    private $input;
    private $currentPosition;

    function __construct($input) {
        $this->input = trim($input);
        $this->currentPosition = 0;
    }

    /**
     * @return Token
     */
    function getToken() {
        if(count($this->tokens)==0) {
            $token = $this->nextToken();
            if($token==null) {
                return null;
            }
            array_push($this->tokens, $token);
        }
        return $this->tokens[0];
    }

    function consumeToken() {
        $token = $this->getToken();
        if($token==null) {
            return null;
        }
        array_shift($this->tokens);
        return $token;
    }

    protected function nextToken() {
        $reservedCharacters = '\:\s\(\)';
        $fieldExpr = '/^([^'.$reservedCharacters.']+)\:([^'.$reservedCharacters.']+)/';
        $keyWord = '/^([^'.$reservedCharacters.']+)/';
        $andOperator = '/^AND\s/';
        $orOperator = '/^OR\s/';
        // Remove whitespaces ..
        $whiteSpaces = '/^\s+/';
        $remaining = substr($this->input,$this->currentPosition);
        if(preg_match($whiteSpaces, $remaining, $matches)) {
            $this->currentPosition += strlen($matches[0]);
            $remaining = substr($this->input,$this->currentPosition);
        }
        if($remaining=='') {
            return null;
        }
        switch(substr($remaining,0,1)) {
            case '(':
                return new Token(Token::OPEN_PAR,'(',$this->currentPosition++);
            case ')':
                return new Token(Token::CLOSE_PAR,')',$this->currentPosition++);
        }
        if(preg_match($fieldExpr, $remaining, $matches)) {
            $token = new Token(Token::FIELD, $matches[0], $this->currentPosition);
            $this->currentPosition += strlen($matches[0]);
        } else if(preg_match($andOperator, $remaining, $matches)) {
            $token = new Token(Token::AND_OP, 'AND', $this->currentPosition);
            $this->currentPosition += 3;
        } else if(preg_match($orOperator, $remaining, $matches)) {
            $token = new Token(Token::OR_OP, 'OR', $this->currentPosition);
            $this->currentPosition += 2;
        } else if(preg_match($keyWord, $remaining, $matches)) {
            $token = new Token(Token::KEYWORD, $matches[0], $this->currentPosition);
            $this->currentPosition += strlen($matches[0]);
        } else throw new Exception('Unable to tokenize: '.$remaining);
        return $token;
    }
}

class QueryBuilder {
    private $fieldMapping;
    private $fulltextFields;

    function __construct($fieldMapping, $fulltextFields) {
        $this->fieldMapping = $fieldMapping;
        $this->fulltextFields = $fulltextFields;
    }

    function parseSearchRule($input) {
        $t = new Tokenizer($input);
        $token = $t->getToken();
        if($token==null) {
            return '';
        }
        $token = $t->getToken();
        if($token->type!=Token::KEYWORD) {
            $searchRule = $this->parseSimpleSearchRule($t);
        } else {
            $searchRule = '';
        }
        $keywords = '';
        while($token = $t->consumeToken()) {
            if($token->type!=Token::KEYWORD) {
                throw new Exception('Only keywords allowed at end of search rule.');
            }
            if($keywords!='') {
                $keywords .= ' ';
            }
            $keywords .= $token->chars;
        }
        if($keywords!='') {
            $matchClause = 'MATCH (`'.(implode('`,`',$this->fulltextFields)).'`) AGAINST (';
            $keywords = $matchClause.'\''.mysql_real_escape_string($keywords).'\' IN BOOLEAN MODE)';
            if($searchRule=='') {
                $searchRule = $keywords;
            } else {
                $searchRule = '('.$searchRule.') AND ('.$keywords.')';
            }
        }
        return $searchRule;
    }

    protected function parseSimpleSearchRule(Tokenizer $t) {
        $expressions = array();
        do {
            $repeat = false;
            $expressions[] = $this->parseExpression($t);
            $token = $t->getToken();
            if($token->type==Token::OR_OP) {
                $t->consumeToken();
                $repeat = true;
            }
        } while($repeat);
        return implode(' OR ', $expressions);
    }

    protected function parseExpression(Tokenizer $t) {
        $expressions = array();
        do {
            $repeat = false;
            $expressions[] = $this->parseSimpleExpression($t);
            $token = $t->getToken();
            if($token->type==Token::AND_OP) {
                $t->consumeToken();
                $repeat = true;
            }
        } while($repeat);
        return implode(' AND ', $expressions);
    }

    protected function parseSimpleExpression(Tokenizer $t) {
        $token = $t->consumeToken();
        if($token->type==Token::OPEN_PAR) {
            $spr = $this->parseSimpleSearchRule($t);
            $token = $t->consumeToken();
            if($token==null || $token->type!=Token::CLOSE_PAR) {
                throw new Exception('Expected closing parenthesis, found: '.$token->chars);
            }
            return '('.$spr.')';
        } else if($token->type==Token::FIELD) {
            $fieldVal = explode(':', $token->chars,2);
            if(isset($this->fieldMapping[$fieldVal[0]])) {
                return '`'.$this->fieldMapping[$fieldVal[0]].'` = \''.mysql_real_escape_string($fieldVal[1]).'\'';
            }
            throw new Exception('Unknown field selected: '.$token->chars);
        } else {
            throw new Exception('Expected opening parenthesis or field-expression, found: '.$token->chars);
        }
    }
}
?>

更合适的解决方案是首先构建一个解析树,然后在进行进一步分析之后将其转换为查询。

答案 1 :(得分:0)

您的问题分为两部分

  1. 如何解析查询
  2. 如何根据我解析的查询构建全文搜索
  3. 第一个是相当困难的主题。快速搜索没有发现任何等同于你想要的东西。你可以单独使用那个

    在问题1正确之前,不要打扰问题2。

    而不是创建一个可以处理您提出的查询语法的解析器,例如: from:me AND (to:john OR to:jenny) dinner也许一个简单的形式可能就是答案。提供用户搜索的选项列表。

    通过这种方式,您可以启动并运行服务,并在未来的修订版攻击中,更难以解决如何创建解析器以执行您想要的操作。

    在做第2部分时要非常小心,以防止SQL注入攻击。例如,不要直接从查询中获取表名,而是使用查找。

    不是你想要的答案,但我不知道你是否会找到一个开箱即用的答案。更好地定义你的问题是线索。和谷歌是你的朋友。

    DC

    你可能想看......

    http://code.google.com/p/xerxes-portal/source/browse/trunk/lib/Xerxes/QueryParser.php?r=1205

    另外

    http://www.cmsmadesimple.org/api/class_zend___search___lucene___search___query_parser.html
    

    上面链接到2个非常不同的解析器实现(第二个链接打破了stackoverflow,所以我把它编码了)

    DC