TXR:删除有关标题行

时间:2017-04-11 16:34:03

标签: text-processing txr

我有很多数据如下:

There are many ways data could be missing.,,,,,,,,,
,,,,,,,,,,,
An entire interior column could be missing.,,,,,,,,,

[missing/data/inside],,,,,,,,,
a,b,c,,,,,,,
1,,3,,,,,,,
1,,4,,,,,,,
3,,2,,,,,,,
,,,,,,,,,
An indented data with 2 completely missing columns.,,,,,,,,,

,,,,,,,[missing/data/outside],,
,,,,,,,a,b,c
,,,,,,,,3,
,,,,,,,,4,,,,,,,,
,,,,,,,,2,,,,,,,,

我想整理一下:

There are many ways data could be missing.

An entire interior column could be missing.

[missing/data/inside]
a,b,c
1,,3
1,,4
3,,2

An indented data with 2 completely missing columns.

[missing/data/outside]
a,b,c
,3,
,4,
,2,

挑战是:

  • 保留所有非表格文本注释(清理任何前导或尾随逗号)
  • 根据标题
  • 在数据表中保留适当数量的逗号

如果我没有第二次挑战,我只需通过sed输出我的输出:

... | output | sed 's/,*$//g' | sed 's/^,*//g'

我相信数据左边的逗号数量在标题和数据行中是相等的。但是,对于滞后的逗号我不能相信。

我写了以下TXR代码:

@(define empty_line)@\
@  (cases)@\
@/,*/@(eol)@\
@  (or)@\
@/[ ]*/@(eol)@\
@  (or)@\
@(eol)@\
@  (end)@\
@(end)
@(define heading)@/[a-z]+(:[^,]*)?/@(end)
@(define header)@\
@  (cases)@\
@    (heading),@(header)@\
@  (or)@\
@    (heading)@\
@  (end)@\
@(end)
@(define content (hdr))@/.*/@(end)
@(define table (loc head data))
@/,*/[@loc]@(skip)
@{lead /,*/}@{head (header)}@(skip)
@  (collect)
@lead@{data (content head)}@(skip)
@  (until)
@(empty_line)
@  (end)
@(end)
@(collect)
@annotation
@(empty_line)
@(table loc head data)
@(end)
@(output)
@  (repeat)
@annotation

[@loc]
@head
@    (repeat)
@data
@    (end)

@  (end)
@(end)

如何编写content函数以从输入数据中提取适当数量的列?我想也许这可能就像使用collrep指令一样简单:

@(define content (hdr))@\
@  (coll :gap 0 :times (length (split-str hdr ",")))@{x /[^,]/}@(end)@\
@(end)

此代码无法可靠地捕获或清除注释。由于注释可以存在于不是表的任何位置。我怎样才能提取它们并清理它们?我尝试了一些使用@(maybe)和另一个嵌套@(collect)但没有运气的方法。

@  (maybe)
@    (collect)
@/,*/@annotation@/,*/
@    (until)
@(empty_line)
@/,*/[@loc]@(skip)
@    (end)
@  (end)

更新

我试图独立地解决表数据收集部分,为此我编写了以下代码:

@(define heading)@/[^,]+/@(end)
@(define header)@\
@  (cases)@\
@    (heading),@(header)@\
@  (or)@\
@    (heading)@\
@  (end)@\
@(end)
@(define content (hdr))@\
@  (coll :gap 1 :mintimes 1 :maxtimes (length (split-str hdr ",")))@\
@/[^,]*/@\
@  (end)@\
@(end)
@{lead /,*/}@{head (header)}@(skip)
@(collect :gap 0 :vars (data))
@lead@{data (content head)}@/,*/
@(end)
@(output)
@head
@  (repeat)
@data
@  (end)
@(end)

以下是我的示例数据:

,,alpha,foxtrot: m,tango: b,,
,,1,a,3,,
,,1,b,,,
,,whisky,c,foxtrot,,
,,,d,,,
,,1,,,,
,,,c,,,,,,

除倒数第二行外,代码在所有情况下都给出了正确的结果。在我看来,解决这个问题的技巧是为coll编写一个正确提取空白数据的正则表达式。还有另一种方法可以实现这一目标吗?例如,附加必要的剩余逗号?

2 个答案:

答案 0 :(得分:1)

以下是似乎有用的代码:

@(define empty_line)@\
@  (cases)@\
@/,*/@(eol)@\
@  (or)@\
@/[ ]*/@(eol)@\
@  (or)@\
@(eol)@\
@  (end)@\
@(end)
@(define heading)@/[^,]+/@(end)
@(define header)@\
@  (cases)@\
@    (heading),@(header)@\
@  (or)@\
@    (heading)@\
@  (end)@\
@(end)
@(define content (hdr))@\
@/[^,]*/@\
@  (coll :gap 0 :times (- (length (split-str hdr ",")) 1))@\
,@/[^,]*/@\
@  (end)@\
@(end)
@(define table (loc head data))
@/,*/[@loc]@(skip)
@{lead /,*/}@{head (header)}@(skip)
@  (collect)
@lead@{data (content head)}@(skip)
@  (until)
@(empty_line)
@  (end)
@(end)
@(collect)
@  (collect)
@/,*/@{annotation /[A-Za-z0-9]+.*[^,]+/}@/,*/
@  (until)
@    (cases)
@(empty_line)
@/,*/[@loc]@(skip)
@    (or)
@(eof)
@    (end)
@  (end)
@(empty_line)
@(table loc head data)
@(end)
@(output)
@  (repeat)
@    (repeat)
@annotation

@    (end)
[@loc]
@head
@    (repeat)
@data
@    (end)

@  (end)
@(end)

答案 1 :(得分:1)

仅供参考,这是我用一些不同的方法破解的东西。输入早期分为字段,事情从那里开始。

它适用于样本数据但不以正确的方式捕获它(遵循注释行,空行,表的语法)。此外,它不检查表中的数据行是否在缩进位置之前只有空白字段。

无论如何,这可能有用。

@(define get-fields (f line))
@  (bind f @(split-str line ","))
@(end)
@(define is-empty (f line))
@  (require (or [all f empty]
                [all line (op eql #\space)]))
@(end)
@(define is-table-start (f loc pos))
@  (next :list f)
@  (skip)
@  (line pos)
[@loc]
@  (rebind pos @(pred pos))
@  (require (and [all [f 0..pos] empty]
                 [all [f (succ pos)..:] empty]))
@(end)
@(define is-headings (f pos))
@  (require (and [all [f 0..pos] empty]
                 (empty [drop-while empty
                                    (drop-while (f^$ #/[a-z]+(:[^,]*)?/)
                                                [f pos..:])])))
@(end)
@(define out-fields (f))
@  (do (put-line `@{f ","}`))
@(end)
@(repeat)
@line
@  (get-fields f line)
@  (cases)
@    (is-empty f line)
@    (do (put-line))
@  (or)
@    (is-table-start f loc pos)
@    hline
@    (get-fields hf hline)
@    (is-headings hf pos)
@    (collect :gap 0)
@      dline
@      (get-fields df dline)
@    (until)
@      (is-empty df dline)
@    (end)
@    (do (put-line `[@loc]`))
@    (bind headings @(take-while [notf empty] (drop pos hf)))
@    (bind endpos @(+ pos (length headings)))
@    (merge tbl hf df)
@    (output)
@      (repeat)
@        {tbl [pos..endpos] ","}
@      (end)
@    (end)
@  (or)
@    (bind trim-f @[take-while [notf empty] [drop-while empty f]])
@    (do (put-line `@{trim-f ","}`))
@  (end)
@(end)