我正在尝试编写文本处理脚本,因为这似乎是一个相当简单的任务。 我有一个文件,其中包含以下重复模式
111 0 1000 other stuff #<- here a new element begins
some text & #<- "&" or white spaces increment -
some more #<- signal continue on next line
last line
221 1 1.22E22 # new element $2!=0 must be followed by float
text &
contiuned text
c comment line in between
more text &
last line
2221 88 -12.123 &
line1
line2
c comment line
last line
223 0 lll -111 $ element given by line
22 22 -3.14 $ element given by new line
我想得到
111 0 1000 other stuff #<- here a new element begins
some text & #<- "&" or white spaces increment -
some more #<- signal continue on next line
last line &
xyz=1
221 1 1.22E22 # new element $2!=0 must be followed by float
text &
contiuned text
c comment line in between
more text &
last line &
xyz=1
2221 88 -12.123 &
line1
line2
c comment line
last line &
xyz=1
223 0 lll -111 & $ element given by line
xyz=1
22 22 -3.14 & $ element given by new line
xyz=1
我想开发一个awk
脚本,该脚本将一个字符串附加到每个元素的last line
。为此,我的脚本将寻找新的元素模式,并继续读取直到找到下一个元素指示符之一。不幸的是,它无法正常运行,因为它会打印两次最后一行,并且无法追加到文件的最后一行。
function newelement(line) {
split(line, s, " ")
if (s[1] ~/^[0-9]+$/ && ((s[2] ~/^[0-9]+$/ && s[3] ~/\./) || (s[2] == 0 && s[3] !~/\./))) {
return 1
} else {
return -1
}
}
function contline(line) {
if (line~/&/ || line~/^[cC]/ || line~/^\s{3,10}[^\s]./) {
return 1
} else {
return -1
}
}
BEGIN {
subs = " xyz=1 "
} #increment to have the next line in store
FNR == 1 {
getline nextline < FILENAME
}
{
# get the next line
getline nextline < FILENAME
if (newelement($0) == 1 && NR < 3673) {
if (length(old) > 0 || $0~/^$/) {
printf("%s &\n%20s\n", old, subs)
print $0
}
# to capture one line elements with no following continuation
# i.e.
# 221 91 0.5 33333
# 22 0 11
#look at the next line
else if (($0!~/&/ && contline(nextline) == -1)) {
printf("%s &\n%20s\n", $0, subs)
}
}
else {
print "-" $0
}
# store last not - commented line
if ($0!~/^\s{0,20}[cC]/) old = $0
}
注释行中有c
或c
,后跟空白。注释行应保留,但不应在其后附加任何字符串。
答案 0 :(得分:2)
请检查以下代码,让我知道它是否适合您:
$ cat 3.1.awk
BEGIN{
subs = " xyz=1 "
threshold = 3673
}
# return boolean if the current line is a new element
function is_new_element(){
return ($1~/^[0-9]+$/) && (($2 ~ /^[0-9]+$/ && $3~/\./) || ($2 == 0 && $3 !~/\./))
}
# return boolean if the current line is a comment or empty line
function is_comment() {
return /^\s*[cC] / || /^\s*$/
}
# function to append extra text to line
# and followed by comments if applicable
function get_extra_text( extra_text) {
extra_text = sprintf("%s &\n%20s", prev, subs)
text = (text ? text ORS : "") extra_text
if (prev_is_comment) {
text = text ORS comment
prev_is_comment = 0
comment = ""
}
return text
}
NR < threshold {
# replace the above line with the following one if
# you want to process up to the first EMPTY line
#NR==1,/^\s*$/ {
# if the current line is a new element
if (is_new_element()) {
# save the last line and preceeding comments
# into the variable 'text', skip the first new element
if (has_hit_first_new_element) text = get_extra_text()
has_hit_first_new_element = 1
prev_is_new = 1
# before hitting the first new_element line, all lines
# should be printed as-is
} else if (!has_hit_first_new_element) {
print
next
# if current line is a comment
} else if (is_comment()) {
comment = (comment ? comment ORS : "") $0
prev_is_comment = 1
next
# if the current line is neither new nor comment
} else {
# if previous line a new element
if (prev_is_new) {
print (text ? text ORS : "") prev
text = ""
# if previous line is comment
} else if (prev_is_comment) {
print prev ORS comment
prev_is_comment = 0
comment = ""
} else {
print prev
}
prev_is_new = 0
}
# prev saves the last non-comment line
prev = $0
next
}
# print the last block if NR >= threshold
!is_last_block_printed {
print get_extra_text()
is_last_block_printed = 1;
}
# print lines when NR > threshold or after the first EMPTY line
{ print "-" $0 }
位置
这些行分为3类,并以不同的方式处理:
is_new_element()
为true,当当前行是新元素时,标志prev_is_new
标识先前的新元素is_comment()
功能为true,则当前行为注释,prev_is_comment
标识先前的注释行其他说明:
NR < threshold
(代码中的3673)或范围模式NR==1,/^\s*$/
来仅处理一定范围的行。is_last_block_printed
标志和相关代码将确保在上述范围的末尾或END{}
块中打印最后一个处理块&
,如果它们后面是注释或新元素,则必须定义逻辑,即哪个应该优先is_new_element()
之前还有其他行,则该代码将无法正常工作。可以通过添加另一个标志而不使用if (NR > 1)
来更新text
来解决此问题。 测试样本:
$ cat 3.1.txt
111 0 1000 other stuff #<- here a new element begins
some text & #<- "&" or white spaces increment -
some more #<- signal continue on next line
last line
221 1 1.22E22 # new element $2!=0 must be followed by float
text &
contiuned text
c comment line in between
more text &
last line
2221 88 -12.123 &
line1
line2
c comment line 1
last line
c comment line 2
c comment line 3
c comment line 4
c comment line 5
223 0 lll -111
223 0 22 -111
223 0 22 -111
c comment line in between 1
c comment line in between 2
22 22 -3.14
c comment line at the end
输出:
$ awk -f 3.1.awk 3.1.txt
111 0 1000 other stuff #<- here a new element begins
some text & #<- "&" or white spaces increment -
some more #<- signal continue on next line
last line &
xyz=1
221 1 1.22E22 # new element $2!=0 must be followed by float
text &
contiuned text
c comment line in between
more text &
last line &
xyz=1
2221 88 -12.123 &
line1
line2
c comment line 1
last line &
xyz=1
c comment line 2
c comment line 3
c comment line 4
c comment line 5
223 0 lll -111 &
xyz=1
223 0 22 -111 &
xyz=1
223 0 22 -111 &
xyz=1
c comment line in between 1
c comment line in between 2
22 22 -3.14 &
xyz=1
c comment line at the end
一些额外的说明:
在将subs
附加到prev
行时,尾随换行符“ \ n”是处理文本的一个问题。当连续出现new_element行时,这一点尤其重要。
重要的是,代码中的变量prev
被定义为前一个非注释行(上面定义的类别1、3)。 prev
行和当前行之间可能有零个或多个注释(类别2)行。这就是为什么我们在打印常规注释(而不是在new_element行之前的注释)时使用print prev ORS comment
而不是print comment ORS prev
。
一组comment
行(1条或更多连续的注释行)被保存到变量comment
中。如果它恰好在new_element行之前,则将该块附加到变量text
上。其他所有注释块将打印在上述print prev ORS comment
行
函数get_extra_text()
用于处理extra_text,其顺序为:prev subs ORS comments
,其中comments
仅在prev_is_comment
标志为{{ 1}}。请注意,如果连续有new_element行,则同一变量1
可能已经保存了多个text
块。
我们仅在上述类别3行上prev subs ORS comments
(既没有new_element也没有评论)。当我们不用担心尾随换行符或extra_text时,这是一个安全的地方:
print
,然后打印变量prev(这是一个new_element)text
。再次注意,变量prev ORS comment
保存当前行中的最后一个非注释行,它不必是当前行上方的行。 prev
行由于我们将行连接到prev
和text
变量中,因此我们使用以下语法来避免前导ORS(默认情况下为“ \ n”)
comment
如果不需要担心领先的ORS,则可以使用以下命令:
text = (text ? text ORS : "") prev
,由于这些行已附加到这些变量之后,因此我们需要重置
每次使用它们后(即text = text ORS prev
),否则,
串联变量将包含所有先前处理过的行。
最后的记录
text = ""
,以防第一行new_element行之前有行,它们将按原样打印。在此代码中,应该以不同的方式对待第一行new_element行,使用NR == 1并不是安全带。has_hit_first_new_element
块中多余的代码答案 1 :(得分:1)
尝试一下:
function newelement(line){
split(line,s," ")
if(s[1]~/^[0-9]+$/ && ((s[2]~/^[0-9]+$/ && s[3]~/\./)|| (s[2]==0 && s[3]!~/\./))){return 1}
else{return -1}
}
BEGIN{
subs=" xyz=1 "
}
{
if (length($0)==0) next # Skip empty lines, remove or change it according to your needs.
if (newelement($0)==1){
if (length(last_data)>0) {
printf("%s &\n%20s\n",last_data,subs)
if (last_type=="c") {
print comments
}
}
last_data=$0
last_type="i"
} else if($0 ~/^\s*[cC] /) {
if (last_type=="c") comments = comments ORS $0
else comments = $0
last_type="c"
} else {
if (last_type=="c") print comments
else if(length(last_data)>0) print last_data
last_data=$0
last_type="d"
}
}
END{
printf("%s &\n%20s\n",last_data,subs)
if (last_type=="c") print comments
}
三个变量:
last_data
保留最后一条数据行。last_type
用于保留最后一行的类型,i
用于指示符,c
用于注释。comments
保留评论行。