我正在尝试验证给予sphinx的字段是否有效,但我遇到了困难。
想象一下,有效的字段是猫,老鼠,狗,小狗。
有效搜索将是:
所以,我想使用正则表达式在上面的示例中查找cat,dog,mouse等术语,并根据有效术语列表进行检查。
因此,查询如: @(山羊)
会产生错误,因为山羊不是一个有效的术语。
我已经得到了这样我可以用这个正则表达式找到简单的查询,比如@cat :(?:@)([^(] *)
但我无法弄清楚如何找到其余部分。
我正在使用python& django,这是值得的。
答案 0 :(得分:3)
为了匹配所有允许的字段,以下相当可怕的正则表达式工作:
@((?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\))
按顺序返回这些匹配:@cat
,@(cat)
,@(cat, dog)
,@cat
,@dog
,@(cat, dog)
,{{1 }}
正则表达式分解如下:
@ # the literal character "@" ( # match group 1 (?:cat|mouse|dog|puppy) # one of your valid search terms (not captured) \b # a word boundary | # or... \( # a literal opening paren (?: # non-capturing group (?:cat|mouse|dog|puppy) # one of your valid search terms (not captured) (?: # non-capturing group , * # a comma "," plus any number of spaces | # or... (?=\)) # a position followed by a closing paren ) # end non-capture group )+ # end non-capture group, repeat \) # a literal closing paren ) # end match group one.
现在要识别任何无效的搜索,您可以将所有内容包装在负面预测中:
@(?!(?:cat|mouse|dog|puppy)\b|\((?:(?:cat|mouse|dog|puppy)(?:, *|(?=\))))+\)) --^^
这将识别任何@mouse
字符,之后会尝试无效的搜索字词(或术语组合)。修改它以便它匹配无效的尝试而不是仅仅指向它不再那么难。
您必须动态地从您的字段准备@
并将其插入正则表达式的静态其余部分。也不应该太难。
答案 1 :(得分:2)
此pyparsing解决方案遵循与您发布的答案类似的逻辑路径。匹配所有标记,然后根据已知有效标记列表进行检查,将其从报告的结果中删除。只有那些在删除有效值后遗留值的匹配才会被报告为匹配。
from pyparsing import *
# define the pattern of a tag, setting internal results names for easy validation
AT,LPAR,RPAR = map(Suppress,"@()")
term = Word(alphas,alphanums).setResultsName("terms",listAllMatches=True)
sphxTerm = AT + ~White() + ( term | LPAR + delimitedList(term) + RPAR )
# define tags we consider to be valid
valid = set("cat mouse dog".split())
# define a parse action to filter out valid terms, and attach to the sphxTerm
def filterValid(tokens):
tokens = [t for t in tokens.terms if t not in valid]
if not(tokens):
raise ParseException("",0,"")
return tokens
sphxTerm.setParseAction(filterValid)
##### Test out the parser #####
test = """@cat search terms @ house
@(cat) search terms
@(cat, dog) search term @(goat)
@cat searchterm1 @dog searchterm2 @(cat, doggerel)
@(cat, dog) searchterm1 @mouse searchterm2
@caterpillar"""
# scan for invalid terms, and print out the terms and their locations
for t,s,e in sphxTerm.scanString(test):
print "Terms:%s Line: %d Col: %d" % (t, lineno(s, test), col(s, test))
print line(s, test)
print " "*(col(s,test)-1)+"^"
print
有了这些可爱的结果:
Terms:['goat'] Line: 3 Col: 29
@(cat, dog) search term @(goat)
^
Terms:['doggerel'] Line: 4 Col: 39
@cat searchterm1 @dog searchterm2 @(cat, doggerel)
^
Terms:['caterpillar'] Line: 6 Col: 5
@caterpillar
^
最后一段代码将为您完成所有扫描,并为您提供找到的无效标记列表:
# print out all of the found invalid terms
print list(set(sum(sphxTerm.searchString(test), ParseResults([]))))
打印:
['caterpillar', 'goat', 'doggerel']
答案 2 :(得分:1)
这应该有效:
@\((cat|dog|mouse|puppy)\b(,\s*(cat|dog|mouse|puppy)\b)*\)|@(cat|dog|mouse|puppy)\b
它将匹配单个@parameter
或带括号的@(par1, par2)
列表,其中仅包含允许的字词(一个或多个)。
它还确保不接受部分匹配(@caterpillar
)。
答案 3 :(得分:0)
这将匹配猫,狗,老鼠或小狗及其组合的所有字段。
import re
sphinx_term = "@goat some words to search"
regex = re.compile("@\(?(cat|dog|mouse|puppy)(, ?(cat|dog|mouse|puppy))*\)? ")
if regex.search(sphinx_term):
send the query to sphinx...
答案 4 :(得分:0)
试试这个:
field_re = re.compile(r"@(?:([^()\s]+)|\([^()]+\))")
单个字段名称(如cat
中的@cat
)将在组#1中捕获,而带括号的列表中的名称(如@(cat, dog)
)将存储在组#2中。在后一种情况下,您需要使用split()
或其他内容打破列表;没有办法用Python正则表达式单独捕获名称。
答案 5 :(得分:0)
我最后以不同的方式做到这一点,因为以上都没有。首先我找到像@cat这样的字段:
attributes = re.findall('(?:@)([^\( ]*)', query)
接下来,我发现了更复杂的问题:
regex0 = re.compile('''
@ # at sign
(?: # start non-capturing group
\w+ # non-whitespace, one or more
\b # a boundary character (i.e. no more \w)
| # OR
( # capturing group
\( # left paren
[^@(),]+ # not an @(),
(?: # another non-caputing group
, * # a comma, then some spaces
[^@(),]+ # not @(),
)* # some quantity of this non-capturing group
\) # a right paren
) # end of non-capuring group
) # end of non-capturing group
''', re.VERBOSE)
# and this puts them into the attributes list.
groupedAttributes = re.findall(regex0, query)
for item in groupedAttributes:
attributes.extend(item.strip("(").strip(")").split(", "))
接下来,我检查了我发现的属性是否有效,并将它们(唯一地添加到数组中):
# check if the values are valid.
validRegex = re.compile(r'^mice$|^mouse$|^cat$|^dog$')
# if they aren't add them to a new list.
badAttrs = []
for attribute in attributes:
if len(attribute) == 0:
# if it's a zero length attribute, we punt
continue
if validRegex.search(attribute.lower()) == None:
# if the attribute from the search isn't in the valid list
if attribute not in badAttrs:
# and the attribute isn't already in the list
badAttrs.append(attribute)
感谢大家的帮助。我很高兴能拥有它!