AWK:如何在两个“\\”之间的文本块中提取而不考虑返回行

时间:2015-02-18 05:53:07

标签: linux awk

我想从一大块文本中提取某些区域 通过将字段分隔符设置为" \\"但是我总是面临一个问题,因为我的文字包含一些单一的" \"它似乎扰乱了正确的文本提取

INPUT:

1\1\GINC-R1430\FOpt\RB3LYP\6-31G(d,p)\C11H8\ROOT\22-Jan-2015\0\\#N b3l
 yp/6-31G** opt freq=noraman test Maxdisk=1Gb\\3\\0,1\C,-2.6997011275,0
 .2415237678,0.5867242856\C,-0.844160292,1.6395735777,-0.4268479833\C,-
 1.9760161741,1.2551936894,0.1361541401\C,-2.3923087914,-1.0358860734,-
 0.0557643955\C,0.3235980425,0.7875682734,-0.1356859882\C,-1.1093142432
 ,-1.3685423936,-0.3602591004\C,0.1496925203,-0.6332454104,-0.151244509
 2\H,-3.3806331312,0.2996137801,1.4332335206\H,-0.7633170455,2.45988827
 32,-1.1373018124\H,1.7187287121,2.4104501712,0.0387394407\H,-3.1756548
 236,-1.7742599934,-0.224548871\H,-0.9560852099,-2.3752668104,-0.747558
 6451\C,1.6076580336,1.3296735593,0.0442342156\C,2.5669578833,-0.875832
 9525,0.1864536297\H,3.4305876714,-1.5230597241,0.3068386649\C,1.309289
 0866,-1.4290100931,-0.0026907826\H,1.2013201753,-2.5103156986,-0.02627
 39389\C,2.7201916294,0.5158561201,0.2083031485\H,3.7045180838,0.956653
 9373,0.3361669809\\Version=ES64L-G09RevD.01\State=1-A\HF=-423.9087698\
 RMSD=8.508e-09\RMSF=5.945e-05\Dipole=0.3132737,-0.297812,-0.0202519\Qu
 adrupole=2.0644665,1.7222772,-3.7867437,1.9108337,-0.4477432,-0.303338
 1\PG=C01 [X(C11H8)]\\@

输出我正在寻找:

0,1\C,-2.6997011275,0
 .2415237678,0.5867242856\C,-0.844160292,1.6395735777,-0.4268479833\C,-
 1.9760161741,1.2551936894,0.1361541401\C,-2.3923087914,-1.0358860734,-
 0.0557643955\C,0.3235980425,0.7875682734,-0.1356859882\C,-1.1093142432
 ,-1.3685423936,-0.3602591004\C,0.1496925203,-0.6332454104,-0.151244509
 2\H,-3.3806331312,0.2996137801,1.4332335206\H,-0.7633170455,2.45988827
 32,-1.1373018124\H,1.7187287121,2.4104501712,0.0387394407\H,-3.1756548
 236,-1.7742599934,-0.224548871\H,-0.9560852099,-2.3752668104,-0.747558
 6451\C,1.6076580336,1.3296735593,0.0442342156\C,2.5669578833,-0.875832
 9525,0.1864536297\H,3.4305876714,-1.5230597241,0.3068386649\C,1.309289
 0866,-1.4290100931,-0.0026907826\H,1.2013201753,-2.5103156986,-0.02627
 39389\C,2.7201916294,0.5158561201,0.2083031485\H,3.7045180838,0.956653
 9373,0.3361669809

我到目前为止所做的最好的事情是使用一个简单的:

awk 'BEGIN { FS = "\\\\" } ; {print $SELECTED AREA}'

如果可以将字段分隔符设置为" \\"那么所选区域将为$ 4;不考虑" \"

有人知道该怎么做吗?

3 个答案:

答案 0 :(得分:1)

你需要所有八个反斜杠才能得到你想要的东西。

awk -F '\\\\\\\\' '{print $4}'

那是因为你将它们加倍以获得字符串中的文字反斜杠,并再次将它们加倍以获得正则表达式中的文字反斜杠。

顺便说一下,这是一个非常差的字段分隔符选择。

答案 1 :(得分:0)

要获得正确的输出,您需要将记录分隔符设置为以下内容:

awk -F'\\\\\\\\' '{print $4}' RS= file
0,1\C,-2.6997011275,0
 .2415237678,0.5867242856\C,-0.844160292,1.6395735777,-0.4268479833\C,-
 1.9760161741,1.2551936894,0.1361541401\C,-2.3923087914,-1.0358860734,-
 0.0557643955\C,0.3235980425,0.7875682734,-0.1356859882\C,-1.1093142432
 ,-1.3685423936,-0.3602591004\C,0.1496925203,-0.6332454104,-0.151244509
 2\H,-3.3806331312,0.2996137801,1.4332335206\H,-0.7633170455,2.45988827
 32,-1.1373018124\H,1.7187287121,2.4104501712,0.0387394407\H,-3.1756548
 236,-1.7742599934,-0.224548871\H,-0.9560852099,-2.3752668104,-0.747558
 6451\C,1.6076580336,1.3296735593,0.0442342156\C,2.5669578833,-0.875832
 9525,0.1864536297\H,3.4305876714,-1.5230597241,0.3068386649\C,1.309289
 0866,-1.4290100931,-0.0026907826\H,1.2013201753,-2.5103156986,-0.02627
 39389\C,2.7201916294,0.5158561201,0.2083031485\H,3.7045180838,0.956653
 9373,0.3361669809

Yo可能需要gnu awk才能将记录选择器设置为空。

答案 2 :(得分:0)

好的,我感谢ED Morton,Jotne和tripleee 通过使用

设置RS i现在具有正确的输出
awk 'BEGIN {FS="\\\\\\\\"; RS="\n\n";} {print $4}'

由于我没有任何双重空行,因此我认为我的文本块现在是一个区域。 我之前从未考虑过RS,因为我主要处理表解析。 谢谢你