如何使用PEG.js对非空行进行分组

时间:2014-10-12 00:41:37

标签: parsing pegjs

我正在尝试使用PEG.js

解析类别文件

如何对类别进行分组(非空行集后跟空行)

stopwords:fr:aux,au,de,le,du,la,a,et,avec

synonyms:en:flavoured, flavored

synonyms:en:sorbets, sherbets

en:Artisan products
fr:Produits artisanaux

< en:Artisan products
fr:Gressins artisanaux

en:Baby foods
fr:Aliments pour bébé, aliment pour bébé, alimentation pour bébé, aliment bébé, alimentation bébé, aliments bébé

< en:Baby foods
fr:Céréales pour bébé, céréales bébé

< en:Whisky
fr:Whisky écossais
es:Whiskies escoceses
wikipediacategory:Q8718387

现在我可以用这段代码逐行解析:

start = stopwords* synonyms* category+

language_and_words = l:[^:]+ ":" w:[^\n]+ {return {language: l.join(''), words: w.join('')};}

stopwords = "stopwords:" w:language_and_words "\n"+ {return {stopwords: w};}

synonyms = "synonyms:" w:language_and_words "\n"+ {return {synonyms: w};}

category_line = "< "? w:language_and_words "\n"+ {return w;}

category = c:category_line+ {return c;}

我得到了:

{
    "language": "en",
    "words": "Artisan products"
},
{
    "language": "fr",
    "words": "Produits artisanaux"
}

但我想(对于每个小组):

{
    {
        "language": "en",
        "words": "Artisan products"
    },
    {
        "language": "fr",
        "words": "Produits artisanaux"
    }
}

我也试过这个,但它没有分组,我在一些行的开头就得到了。

category_line = "< "? w:language_and_words "\n" {return w;}

category = c:category_line+ "\n" {return c;}

2 个答案:

答案 0 :(得分:0)

我找到了部分解决方案:

start = category+

word = c:[^,\n]+ {return c.join('');}

words = w:word [,]? {return w.trim();}

parent = p:"< "? {return (p !== null);}

line = p:parent w:words+ "\n" {return {parent: p, words: w};}

category = l:line+ "\n"? {return l;}

我可以解析这个......

< fr:a,b
fr:aa,bb

en:d,e,f
fr:dd,ee, ffff

并获得分组:

[
    [ {...}, {...} ],
    [ {...}, {...} ]
]

但是在每个类别的开头都有“lang:”的问题,如果我尝试解析“lang:”我的catégories没有分组......

答案 1 :(得分:0)

我发现迭代地分解解析是有用的(问题分解,旧学校的Wirth)。这是一个部分解决方案,我认为可以帮助您找到正确的方向(我没有解析类别的Line元素。

start = 
  stopwords 
  synonyms 
  category+

category "category"
  = category:(Line)+ categorySeparator { return category }

stopwords "stopwords"
  = stopwordLine*

stopwordLine "stopword line"
  = stopwordLine:StopWordMatch EndOfLine* { return stopwordLine }

StopWordMatch 
  = "stopwords:" match:Text { return match }

synonyms "stopwords"
  = synonymLine*

synonymLine "stopword line"
  = synonymLine:SynonymMatch EndOfLine* { return synonymLine }

SynonymMatch 
  = "synonyms:" match:Text { return match }

Line "line"
  = line:Text [\n] { return line }

Text "text"
  = [^\n]+ { return text() }

EndOfLine "(end of line)"
  = '\n'

EndOfFile 
  = !. { return "EOF"; }

categorySeparator "separator"
  = EndOfLine EndOfLine* / EndOfLine? EndOfFile

我对混合外壳的使用是任意的,不是很时尚。 还有一种方法可以在线保存解决方案:http://peg.arcanis.fr/2WQ7CZ/