正则表达式用于查找有效的sphinx字段

时间:2010-04-20 18:41:03

标签: python regex django sphinx

我正在尝试验证给予sphinx的字段是否有效,但我遇到了困难。

想象一下,有效的字段是猫,老鼠,狗,小狗。

有效搜索将是:

  • @cat搜索字词
  • @(cat)搜索字词
  • @(cat,dog)搜索词
  • @cat searchterm1 @dog searchterm2
  • @(cat,dog)searchterm1 @mouse searchterm2

所以,我想使用正则表达式在上面的示例中查找cat,dog,mouse等术语,并根据有效术语列表进行检查。

因此,查询如: @(山羊)

会产生错误,因为山羊不是一个有效的术语。

我已经得到了这样我可以用这个正则表达式找到简单的查询,比如@cat :(?:@)([^(] *)

但我无法弄清楚如何找到其余部分。

我正在使用python& django,这是值得的。

6 个答案:

答案 0 :(得分:3)

为了匹配所有允许的字段,以下相当可怕的正则表达式工作:

@((?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))

按顺序返回这些匹配:@cat@(cat)@(cat, dog)@cat@dog@(cat, dog),{{1 }}

正则表达式分解如下:

@                               # the literal character "@"
(                               # match group 1
  (?:cat|mouse|dog|puppy)       #  one of your valid search terms (not captured)
  \b                            #  a word boundary
  |                             #  or...
  \(                            #  a literal opening paren
  (?:                           #  non-capturing group
    (?:cat|mouse|dog|puppy)     #   one of your valid search terms (not captured)
    (?:                         #   non-capturing group
      , *                       #    a comma "," plus any number of spaces
      |                         #    or...
      (?=\))                    #    a position followed by a closing paren
    )                           #   end non-capture group
  )+                            #  end non-capture group, repeat
  \)                            #  a literal closing paren
)                               # end match group one.

现在要识别任何无效的搜索,您可以将所有内容包装在负面预测中:

@(?!(?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
--^^

这将识别任何@mouse字符,之后会尝试无效的搜索字词(或术语组合)。修改它以便它匹配无效的尝试而不是仅仅指向它不再那么难。

您必须动态地从您的字段准备@并将其插入正则表达式的静态其余部分。也不应该太难。

答案 1 :(得分:2)

此pyparsing解决方案遵循与您发布的答案类似的逻辑路径。匹配所有标记,然后根据已知有效标记列表进行检查,将其从报告的结果中删除。只有那些在删除有效值后遗留值的匹配才会被报告为匹配。

from pyparsing import *

# define the pattern of a tag, setting internal results names for easy validation
AT,LPAR,RPAR = map(Suppress,"@()")
term = Word(alphas,alphanums).setResultsName("terms",listAllMatches=True)
sphxTerm = AT + ~White() + ( term | LPAR + delimitedList(term) + RPAR )

# define tags we consider to be valid
valid = set("cat mouse dog".split())

# define a parse action to filter out valid terms, and attach to the sphxTerm
def filterValid(tokens):
    tokens = [t for t in tokens.terms if t not in valid]
    if not(tokens):
        raise ParseException("",0,"")
    return tokens
sphxTerm.setParseAction(filterValid)


##### Test out the parser #####

test = """@cat search terms @ house
    @(cat) search terms 
    @(cat, dog) search term @(goat)
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
    @(cat, dog) searchterm1 @mouse searchterm2 
    @caterpillar"""

# scan for invalid terms, and print out the terms and their locations
for t,s,e in sphxTerm.scanString(test):
    print "Terms:%s Line: %d Col: %d" % (t, lineno(s, test), col(s, test))
    print line(s, test)
    print " "*(col(s,test)-1)+"^"
    print

有了这些可爱的结果:

Terms:['goat'] Line: 3 Col: 29
    @(cat, dog) search term @(goat)
                            ^

Terms:['doggerel'] Line: 4 Col: 39
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
                                      ^

Terms:['caterpillar'] Line: 6 Col: 5
    @caterpillar
    ^

最后一段代码将为您完成所有扫描,并为您提供找到的无效标记列表:

# print out all of the found invalid terms
print list(set(sum(sphxTerm.searchString(test), ParseResults([]))))

打印:

['caterpillar', 'goat', 'doggerel']

答案 2 :(得分:1)

这应该有效:

@\((cat|dog|mouse|puppy)\b(,\s*(cat|dog|mouse|puppy)\b)*\)|@(cat|dog|mouse|puppy)\b

它将匹配单个@parameter或带括号的@(par1, par2)列表,其中仅包含允许的字词(一个或多个)。

它还确保不接受部分匹配(@caterpillar)。

答案 3 :(得分:0)

这将匹配猫,狗,老鼠或小狗及其组合的所有字段。

import re
sphinx_term = "@goat some words to search"
regex = re.compile("@\(?(cat|dog|mouse|puppy)(, ?(cat|dog|mouse|puppy))*\)? ")
if regex.search(sphinx_term):
    send the query to sphinx...

答案 4 :(得分:0)

试试这个:

field_re = re.compile(r"@(?:([^()\s]+)|\([^()]+\))")

单个字段名称(如cat中的@cat)将在组#1中捕获,而带括号的列表中的名称(如@(cat, dog))将存储在组#2中。在后一种情况下,您需要使用split()或其他内容打破列表;没有办法用Python正则表达式单独捕获名称。

答案 5 :(得分:0)

我最后以不同的方式做到这一点,因为以上都没有。首先我找到像@cat这样的字段:

attributes = re.findall('(?:@)([^\( ]*)', query)

接下来,我发现了更复杂的问题:

regex0 = re.compile('''
    @               # at sign
    (?:             # start non-capturing group
        \w+             # non-whitespace, one or more
        \b              # a boundary character (i.e. no more \w)
        |               # OR
        (               # capturing group
            \(              # left paren
            [^@(),]+        # not an @(),
            (?:                 # another non-caputing group
                , *             # a comma, then some spaces
                [^@(),]+        # not @(),
            )*              # some quantity of this non-capturing group
            \)              # a right paren
        )               # end of non-capuring group
    )           # end of non-capturing group
    ''', re.VERBOSE)

# and this puts them into the attributes list.
groupedAttributes = re.findall(regex0, query)
for item in groupedAttributes:
    attributes.extend(item.strip("(").strip(")").split(", "))

接下来,我检查了我发现的属性是否有效,并将它们(唯一地添加到数组中):

# check if the values are valid.
validRegex = re.compile(r'^mice$|^mouse$|^cat$|^dog$')

# if they aren't add them to a new list.
badAttrs = []
for attribute in attributes:
    if len(attribute) == 0:
        # if it's a zero length attribute, we punt
        continue
    if validRegex.search(attribute.lower()) == None:
        # if the attribute from the search isn't in the valid list
        if attribute not in badAttrs:
            # and the attribute isn't already in the list
            badAttrs.append(attribute)

感谢大家的帮助。我很高兴能拥有它!