我有很多数据如下:
There are many ways data could be missing.,,,,,,,,,
,,,,,,,,,,,
An entire interior column could be missing.,,,,,,,,,
[missing/data/inside],,,,,,,,,
a,b,c,,,,,,,
1,,3,,,,,,,
1,,4,,,,,,,
3,,2,,,,,,,
,,,,,,,,,
An indented data with 2 completely missing columns.,,,,,,,,,
,,,,,,,[missing/data/outside],,
,,,,,,,a,b,c
,,,,,,,,3,
,,,,,,,,4,,,,,,,,
,,,,,,,,2,,,,,,,,
我想整理一下:
There are many ways data could be missing.
An entire interior column could be missing.
[missing/data/inside]
a,b,c
1,,3
1,,4
3,,2
An indented data with 2 completely missing columns.
[missing/data/outside]
a,b,c
,3,
,4,
,2,
挑战是:
如果我没有第二次挑战,我只需通过sed
输出我的输出:
... | output | sed 's/,*$//g' | sed 's/^,*//g'
我相信数据左边的逗号数量在标题和数据行中是相等的。但是,对于滞后的逗号我不能相信。
我写了以下TXR
代码:
@(define empty_line)@\
@ (cases)@\
@/,*/@(eol)@\
@ (or)@\
@/[ ]*/@(eol)@\
@ (or)@\
@(eol)@\
@ (end)@\
@(end)
@(define heading)@/[a-z]+(:[^,]*)?/@(end)
@(define header)@\
@ (cases)@\
@ (heading),@(header)@\
@ (or)@\
@ (heading)@\
@ (end)@\
@(end)
@(define content (hdr))@/.*/@(end)
@(define table (loc head data))
@/,*/[@loc]@(skip)
@{lead /,*/}@{head (header)}@(skip)
@ (collect)
@lead@{data (content head)}@(skip)
@ (until)
@(empty_line)
@ (end)
@(end)
@(collect)
@annotation
@(empty_line)
@(table loc head data)
@(end)
@(output)
@ (repeat)
@annotation
[@loc]
@head
@ (repeat)
@data
@ (end)
@ (end)
@(end)
如何编写content
函数以从输入数据中提取适当数量的列?我想也许这可能就像使用coll
或rep
指令一样简单:
@(define content (hdr))@\
@ (coll :gap 0 :times (length (split-str hdr ",")))@{x /[^,]/}@(end)@\
@(end)
此代码无法可靠地捕获或清除注释。由于注释可以存在于不是表的任何位置。我怎样才能提取它们并清理它们?我尝试了一些使用@(maybe)
和另一个嵌套@(collect)
但没有运气的方法。
@ (maybe)
@ (collect)
@/,*/@annotation@/,*/
@ (until)
@(empty_line)
@/,*/[@loc]@(skip)
@ (end)
@ (end)
更新
我试图独立地解决表数据收集部分,为此我编写了以下代码:
@(define heading)@/[^,]+/@(end)
@(define header)@\
@ (cases)@\
@ (heading),@(header)@\
@ (or)@\
@ (heading)@\
@ (end)@\
@(end)
@(define content (hdr))@\
@ (coll :gap 1 :mintimes 1 :maxtimes (length (split-str hdr ",")))@\
@/[^,]*/@\
@ (end)@\
@(end)
@{lead /,*/}@{head (header)}@(skip)
@(collect :gap 0 :vars (data))
@lead@{data (content head)}@/,*/
@(end)
@(output)
@head
@ (repeat)
@data
@ (end)
@(end)
以下是我的示例数据:
,,alpha,foxtrot: m,tango: b,,
,,1,a,3,,
,,1,b,,,
,,whisky,c,foxtrot,,
,,,d,,,
,,1,,,,
,,,c,,,,,,
除倒数第二行外,代码在所有情况下都给出了正确的结果。在我看来,解决这个问题的技巧是为coll
编写一个正确提取空白数据的正则表达式。还有另一种方法可以实现这一目标吗?例如,附加必要的剩余逗号?
答案 0 :(得分:1)
以下是似乎有用的代码:
@(define empty_line)@\
@ (cases)@\
@/,*/@(eol)@\
@ (or)@\
@/[ ]*/@(eol)@\
@ (or)@\
@(eol)@\
@ (end)@\
@(end)
@(define heading)@/[^,]+/@(end)
@(define header)@\
@ (cases)@\
@ (heading),@(header)@\
@ (or)@\
@ (heading)@\
@ (end)@\
@(end)
@(define content (hdr))@\
@/[^,]*/@\
@ (coll :gap 0 :times (- (length (split-str hdr ",")) 1))@\
,@/[^,]*/@\
@ (end)@\
@(end)
@(define table (loc head data))
@/,*/[@loc]@(skip)
@{lead /,*/}@{head (header)}@(skip)
@ (collect)
@lead@{data (content head)}@(skip)
@ (until)
@(empty_line)
@ (end)
@(end)
@(collect)
@ (collect)
@/,*/@{annotation /[A-Za-z0-9]+.*[^,]+/}@/,*/
@ (until)
@ (cases)
@(empty_line)
@/,*/[@loc]@(skip)
@ (or)
@(eof)
@ (end)
@ (end)
@(empty_line)
@(table loc head data)
@(end)
@(output)
@ (repeat)
@ (repeat)
@annotation
@ (end)
[@loc]
@head
@ (repeat)
@data
@ (end)
@ (end)
@(end)
答案 1 :(得分:1)
仅供参考,这是我用一些不同的方法破解的东西。输入早期分为字段,事情从那里开始。
它适用于样本数据但不以正确的方式捕获它(遵循注释行,空行,表的语法)。此外,它不检查表中的数据行是否在缩进位置之前只有空白字段。
无论如何,这可能有用。
@(define get-fields (f line))
@ (bind f @(split-str line ","))
@(end)
@(define is-empty (f line))
@ (require (or [all f empty]
[all line (op eql #\space)]))
@(end)
@(define is-table-start (f loc pos))
@ (next :list f)
@ (skip)
@ (line pos)
[@loc]
@ (rebind pos @(pred pos))
@ (require (and [all [f 0..pos] empty]
[all [f (succ pos)..:] empty]))
@(end)
@(define is-headings (f pos))
@ (require (and [all [f 0..pos] empty]
(empty [drop-while empty
(drop-while (f^$ #/[a-z]+(:[^,]*)?/)
[f pos..:])])))
@(end)
@(define out-fields (f))
@ (do (put-line `@{f ","}`))
@(end)
@(repeat)
@line
@ (get-fields f line)
@ (cases)
@ (is-empty f line)
@ (do (put-line))
@ (or)
@ (is-table-start f loc pos)
@ hline
@ (get-fields hf hline)
@ (is-headings hf pos)
@ (collect :gap 0)
@ dline
@ (get-fields df dline)
@ (until)
@ (is-empty df dline)
@ (end)
@ (do (put-line `[@loc]`))
@ (bind headings @(take-while [notf empty] (drop pos hf)))
@ (bind endpos @(+ pos (length headings)))
@ (merge tbl hf df)
@ (output)
@ (repeat)
@ {tbl [pos..endpos] ","}
@ (end)
@ (end)
@ (or)
@ (bind trim-f @[take-while [notf empty] [drop-while empty f]])
@ (do (put-line `@{trim-f ","}`))
@ (end)
@(end)