如何使用正则表达式解析包含布尔运算符的搜索查询?

时间:2013-12-10 15:10:03

标签: regex perl search booleanquery

我有一个搜索查询列表,这些搜索查询会遇到许多“索引”:

ISBN for Equiv.= "9780138156763"
Words= Journal of strategic studies
ISBN for Equiv.= "9781443845120"
Words= dying to live
Words= On the air : Kunst im öffentlichen datenraum
Words= WSE%3D%28Early%20American%20histories%29
Words= curing studies of a polyimide
ISBN for Equiv.= "9781894717021"
sys= 002568960
Words= worship gia
Words= vera cruz and W-Type= Visual
W-title= "THE" AND "INFALLIBILITY" AND "OF" AND "THE" AND "POPE"
Words= it's alawys sunny
OCLC/Control No.= "ocolc10975211"
W-title= "small" AND "high" AND "school"
Words= galleria borghese dipinti
ISBN for Equiv.= "9780898128574"
sys= 003057761
Words= russkie skazki
ISBN for Equiv.= "9781416589945"
ISSN for Equiv.= "0332-4117"
Words= metal oxides biomass conversion
Words= presence de catulle
ISBN for Equiv.= "9780230231702"
ISBN for Equiv.= "9781458421227"
ISBN for Equiv.= "9780199583126"
ISBN for Equiv.= "9781459237957"
ISBN for Equiv.= "9780545572064"

更复杂的可以包括括号:

( Words= creat? OR Words= intelligent? design? OR W-LC= bl or bt or p or pa or qh ) and W-Language= ENG and W-Type= Book and W-new date= 20100601->20100607
( W-publ.= ( ave maria ) AND Words= ( alldocuments ) ) and ( W-Type= ( Book ) ) AND Words= pope
( Words= ( jazz trombone ) AND Words= ( jazz trombone ) ) and ( W-Type= ( Score ) ) AND Words= miles davis
( Words= ( texas vernacular ) AND Words= ( alldocuments ) ) and ( W-sublibrary= ( HESB OR LIFES ) ) AND Words= alldocuments and W-sublibrary= HESB OR LIFES
(Words= ( paleopathology ) AND xxx=alldocuments) and ( W-Language=ENG and W-sublibrary=HESB OR LIFES)
( Words= ( paleopathology ) AND Words= ( alldocuments ) ) and ( W-Language= ( ENG ) and W-sublibrary= ( HESB OR LIFES ) ) AND Words= Jordan
( Words= ( paleopathology ) AND Words= ( alldocuments ) ) and ( W-Language= ( ENG ) and W-sublibrary= ( HESB OR LIFES ) ) AND Words= Middle East
(((Words=("selected?) AND Words=(ensemble?)) AND Words=(sonatas"?)) AND Words=(castello?)) AND Words=(parts?)

我需要根据每次搜索命中的索引('='之前的部分)生成统计信息。我对多索引搜索和括号搜索感到困惑。由于像“Equiv的ISBN”这样的索引,我无法在空格和等号之间取得所有内容。或“OCLC /控制号”,但一般来说,我需要空格和等号之间的所有内容。现在我有以下代码(Perl):

while ($q =~ /(?:\( ?|(?:AND|NOT|OR|and|not|or) )?(.+?)=/g) {
    $indexHits{$1}{$dt->year." ".$dt->month_abbr()}++;
    $hitcount++;
}

问题在于,这是从线上的先前索引搜索中获取实际搜索项,因此它并没有干净地计算索引命中率。

0 个答案:

没有答案