简约地解析成分表

时间:2018-07-27 16:00:57

标签: python parsing peg parsimonious

from parsimonious.grammar import Grammar

grammar = Grammar(
     '''
     # item = ( ( ingredient+ '(' ingredient+ ')' comma ws?) + / comma+ / ws+ / ingredient )+
     item = ((ingredient '(' ingredient ')') / ingredient )+

     ingredient = ( ( bold_tag? word+ ws? percent? )+ comma? ws? )
     word = ~"[A-Z0-9:]+"i
     ws = ~"\s*"+
     comma = ","
     bold_tag = "<b>"
     percent = ~"\([\d\.%]+\)"
     ''')

grammar = Grammar(
     """
     ingredient_item = words open_bracket (subingredient)* closed_bracket comma
     text_preceding_colon = (text colon)*
     text       = ~"[A-Z 0-9]*"i
     space = ~"[\\s]*"
     colon = ":"
     words = text+
     open_bracket = "("
     closed_bracket = ")"
     subingredient = text / b_tag / percentage
     percentage = open_bracket percentage_num closed_bracket
     percentage_num = ~"[0-9\.%]*"
     b_tag = "<b>"
     comma = ","
     """)

grammar = Grammar(
     '''
     item = (ingredient+ comma? )+
     ingredient = word? / bold_tag? / ws? / percent?
     comma = ','
     word = ~"[A-Z0-9:]+"i
     bold_tag = "<b>"
     ws = ~"\s*"
     percent = ~"\([\d\.%]+\)"
     '''
)

test_string = '''Partially Inverted <b>Brown <b>Sugar Syrup (43.3%), Salt, Acidity Regulator: Tripotassium Phosphate, Sunflower Oil'''

test_string = test_string.decode('utf-8')

tree = grammar.parse(test_string)

嗨,我试图用简约的方法来解析成分表,但没有任何运气。以上是我迄今为止尝试编写的所有语法。我想通过逗号将其拆分为组成部分,而忽略嵌套方括号,如下面更复杂的测试字符串所示:

test_string2 = '''Cereal Grains (Whole Grain <b>Oat Flour (28.3%), Whole Grain <b>Wheat (28.3%), Whole Grain <b>Barley Flour (17.1%), Whole Grain Maize Flour (2.0%), Whole Grain Rice Flour (2.0%)), Sugar, <b>Wheat Starch, Partially Inverted Brown Sugar Syrup, Salt, Acidity Regulator: Tripotassium Phosphate, Sunflower Oil, Colours: Carotene, Caramel and Annatto, Antioxidant: Tocopherols, Vitamins and Minerals: Vitamin C, Niacin (B3), Pantothenic Acid, Riboflavin (B2), Vitamin B6, Folic Acid, Vitamin D, Calcium Carbonate, Iron, To produce 100g of this product we have used 77.7g of Whole Grain, We guarantee every Nestlé Cereal with the green banner contains at least 8g of Whole Grain per serving'''

每种成分都可能包含由'('或百分号或类似'B2'的单词组成的子成分列表,并且由一个或多个单词组成。

0 个答案:

没有答案