Question

我正在尝试验证给予sphinx的字段是否有效，但我遇到了困难。

想象一下，有效的字段是猫，老鼠，狗，小狗。

有效搜索将是：

@cat搜索字词
@（cat）搜索字词
@（cat，dog）搜索词
@cat searchterm1 @dog searchterm2
@（cat，dog）searchterm1 @mouse searchterm2

所以，我想使用正则表达式在上面的示例中查找cat，dog，mouse等术语，并根据有效术语列表进行检查。

因此，查询如： @（山羊）

会产生错误，因为山羊不是一个有效的术语。

我已经得到了这样我可以用这个正则表达式找到简单的查询，比如@cat :(？：@）（[^（] *）

但我无法弄清楚如何找到其余部分。

我正在使用python＆amp; django，这是值得的。

Answer 1

为了匹配所有允许的字段，以下相当可怕的正则表达式工作：

@((?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))

按顺序返回这些匹配：@cat，@(cat)，@(cat, dog)，@cat，@dog，@(cat, dog)，{{1 }}

正则表达式分解如下：

@                               # the literal character "@"
(                               # match group 1
  (?:cat|mouse|dog|puppy)       #  one of your valid search terms (not captured)
  \b                            #  a word boundary
  |                             #  or...
  \(                            #  a literal opening paren
  (?:                           #  non-capturing group
    (?:cat|mouse|dog|puppy)     #   one of your valid search terms (not captured)
    (?:                         #   non-capturing group
      , *                       #    a comma "," plus any number of spaces
      |                         #    or...
      (?=\))                    #    a position followed by a closing paren
    )                           #   end non-capture group
  )+                            #  end non-capture group, repeat
  \)                            #  a literal closing paren
)                               # end match group one.

现在要识别任何无效的搜索，您可以将所有内容包装在负面预测中：

@(?!(?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
--^^

这将识别任何@mouse字符，之后会尝试无效的搜索字词（或术语组合）。修改它以便它匹配无效的尝试而不是仅仅指向它不再那么难。

您必须动态地从您的字段准备@并将其插入正则表达式的静态其余部分。也不应该太难。

Answer 2

此pyparsing解决方案遵循与您发布的答案类似的逻辑路径。匹配所有标记，然后根据已知有效标记列表进行检查，将其从报告的结果中删除。只有那些在删除有效值后遗留值的匹配才会被报告为匹配。

from pyparsing import *

# define the pattern of a tag, setting internal results names for easy validation
AT,LPAR,RPAR = map(Suppress,"@()")
term = Word(alphas,alphanums).setResultsName("terms",listAllMatches=True)
sphxTerm = AT + ~White() + ( term | LPAR + delimitedList(term) + RPAR )

# define tags we consider to be valid
valid = set("cat mouse dog".split())

# define a parse action to filter out valid terms, and attach to the sphxTerm
def filterValid(tokens):
    tokens = [t for t in tokens.terms if t not in valid]
    if not(tokens):
        raise ParseException("",0,"")
    return tokens
sphxTerm.setParseAction(filterValid)


##### Test out the parser #####

test = """@cat search terms @ house
    @(cat) search terms 
    @(cat, dog) search term @(goat)
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
    @(cat, dog) searchterm1 @mouse searchterm2 
    @caterpillar"""

# scan for invalid terms, and print out the terms and their locations
for t,s,e in sphxTerm.scanString(test):
    print "Terms:%s Line: %d Col: %d" % (t, lineno(s, test), col(s, test))
    print line(s, test)
    print " "*(col(s,test)-1)+"^"
    print

有了这些可爱的结果：

Terms:['goat'] Line: 3 Col: 29
    @(cat, dog) search term @(goat)
                            ^

Terms:['doggerel'] Line: 4 Col: 39
    @cat searchterm1 @dog searchterm2 @(cat, doggerel)
                                      ^

Terms:['caterpillar'] Line: 6 Col: 5
    @caterpillar
    ^

最后一段代码将为您完成所有扫描，并为您提供找到的无效标记列表：

# print out all of the found invalid terms
print list(set(sum(sphxTerm.searchString(test), ParseResults([]))))

打印：

['caterpillar', 'goat', 'doggerel']

Answer 3

这应该有效：

@\((cat|dog|mouse|puppy)\b(,\s*(cat|dog|mouse|puppy)\b)*\)|@(cat|dog|mouse|puppy)\b

它将匹配单个@parameter或带括号的@(par1, par2)列表，其中仅包含允许的字词（一个或多个）。

它还确保不接受部分匹配（@caterpillar）。

Answer 4

这将匹配猫，狗，老鼠或小狗及其组合的所有字段。

import re
sphinx_term = "@goat some words to search"
regex = re.compile("@\(?(cat|dog|mouse|puppy)(, ?(cat|dog|mouse|puppy))*\)? ")
if regex.search(sphinx_term):
    send the query to sphinx...

Answer 5

试试这个：

field_re = re.compile(r"@(?:([^()\s]+)|\([^()]+\))")

单个字段名称（如cat中的@cat）将在组＃1中捕获，而带括号的列表中的名称（如@(cat, dog)）将存储在组＃2中。在后一种情况下，您需要使用split()或其他内容打破列表;没有办法用Python正则表达式单独捕获名称。

Answer 6

我最后以不同的方式做到这一点，因为以上都没有。首先我找到像@cat这样的字段：

attributes = re.findall('(?:@)([^\( ]*)', query)

接下来，我发现了更复杂的问题：

regex0 = re.compile('''
    @               # at sign
    (?:             # start non-capturing group
        \w+             # non-whitespace, one or more
        \b              # a boundary character (i.e. no more \w)
        |               # OR
        (               # capturing group
            \(              # left paren
            [^@(),]+        # not an @(),
            (?:                 # another non-caputing group
                , *             # a comma, then some spaces
                [^@(),]+        # not @(),
            )*              # some quantity of this non-capturing group
            \)              # a right paren
        )               # end of non-capuring group
    )           # end of non-capturing group
    ''', re.VERBOSE)

# and this puts them into the attributes list.
groupedAttributes = re.findall(regex0, query)
for item in groupedAttributes:
    attributes.extend(item.strip("(").strip(")").split(", "))

接下来，我检查了我发现的属性是否有效，并将它们（唯一地添加到数组中）：

# check if the values are valid.
validRegex = re.compile(r'^mice$|^mouse$|^cat$|^dog$')

# if they aren't add them to a new list.
badAttrs = []
for attribute in attributes:
    if len(attribute) == 0:
        # if it's a zero length attribute, we punt
        continue
    if validRegex.search(attribute.lower()) == None:
        # if the attribute from the search isn't in the valid list
        if attribute not in badAttrs:
            # and the attribute isn't already in the list
            badAttrs.append(attribute)

感谢大家的帮助。我很高兴能拥有它！

正则表达式用于查找有效的sphinx字段

6 个答案: