Linux中提取模式字符串和后续模式字符串的简短方法是什么?

时间:2016-11-19 13:41:27

标签: regex linux grep

假设我们在文件中存储一行文本:

df$t

我想要的是这个特定的输入提取 3个条目

// In the actual file this will be one line
{unrelated_text1,ID:13, unrelated_text2,TIMESTAMP:1476280500,unrelated_text3},   
{other_unrelated_text1,other_unrelated_text2,ID:25,TIMESTAMP:1476280600},
{ID:30,more_unrelated_text1,TIMESTAMP:1476280700},
{ID:40,final_unrelated_text}

到目前为止,我找到的最接近的命令是

// The details, such as whether to put { character in front or not do not matter.
// Any form of output which extracts only these 3 entries and groups them in a 
// visually nice way will do the job.
{ID:13, TIMESTAMP:1476280500}
{ID:25, TIMESTAMP:1476280600}
{ID:30, TIMESTAMP:1476280700}
// I do not want the last entry, because it does not contain timestamp field.

给出输出

grep -Po {ID:[0-9]+(.+?)} input_file

我要搜索的下一个改进是如何从每个条目中删除{unrelated_text1,ID:13,unrelated_text2,TIMESTAMP:1476280500,unrelated_text3} {other_unrelated_text1,other_unrelated_text2,ID:25,TIMESTAMP:1476280600} {ID:30,more_unrelated_text1,TIMESTAMP:1476280700} {ID:40,final_unrelated_text} 并删除最后一个条目。

问题:在Linux中最简单的方法是什么?

1 个答案:

答案 0 :(得分:1)

使用GNU awk实现多字符RS和RT和字边界:

$ awk -v RS='\\<(ID|TIMESTAMP):[0-9]+' 'NR%2{id=RT;next} RT{printf "{%s, %s}\n", id, RT}' file
{ID:13, TIMESTAMP:1476280500}
{ID:25, TIMESTAMP:1476280600}
{ID:30, TIMESTAMP:1476280700}

无论输入是在一行还是多行,并且无论文件中有哪些其他文本,上述内容都会有效,所有它依赖的是每个相关TIMESTAMP之前出现的ID,而不是如有必要,很难改变。