我有一个无界文本文件,包含大约一百万行。
示例行
1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659 VIDYA.SAGAR1@bank.IN VIDYA SAGAR CROSS BANDRA WM DELHI 456471
3000000027
在以数字“ 2”,“ 1”,“ 3”(行类型)开头的每一行中,我必须根据字符数(即在0-1、1-20、21-25结尾)插入定界符。 ..等等
如何使用Linux脚本执行此操作?
所需的输出
1|YBL LOYALTY EXT |10001|01172019|001
2|00010010100001151|2753|184907301010614199100919699034659 |VIDYA.SAGAR1@bank.IN |VIDYA SAGAR |CROSS |BANDRA |WM |DELHI |456471
3|000000027
我尝试了此命令
perl -ne ' if(/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_"} if(/^1/) { @x=(1,16,5,8); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" } if(/^3/) { @x=(1); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }' filename`
输入行
1YBL LOYALTY EXT 1000112102018001
2000100101000002631653184911501010111199100919323739251 VIJAYPANDEY1191@GMAIL.COM VIJAY PANDEY PART OF GROUND FLOOR & BASEMENT SHOPPER STOP SV ROAD ANDHERI WEST LANDMARK-ERSTWHILE CRASSWORD BOOK STORE MUMBAI 400058
2000100101000019920453184964321010513199000919878857482 MAKSUDMASTER7775@GMAIL.COM MOHAMAD MAQSHUD MASTER H COLLECTION NEW SHIVPURI GALI NO 1 NEAR MAKHAN SINGH CHOWK LUDHIANA 141008
2000100101000023500853184923441010913197300919375580888 JAYNTITALA@GMAIL.COM JAYANTIBHAI TADA 44 KHODIYAR NAGAR B S ABHISHEK SUDAMA CHOWK KHODIYARNAGAR MOTA VARACHHA SURAT 395006
3000000066
预期产量
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251 |VIJAYPANDEY1191@GMAIL.COM |VIJAY PANDEY |PART OF GROUND FLOOR & BASEMENT |SHOPPER STOP SV ROAD ANDHERI WEST |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482 |MAKSUDMASTER7775@GMAIL.COM |MOHAMAD MAQSHUD MASTER |H COLLECTION NEW SHIVPURI |GALI NO 1 |NEAR MAKHAN SINGH CHOWK |LUDHIANA |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888 |JAYNTITALA@GMAIL.COM |JAYANTIBHAI TADA |44 KHODIYAR NAGAR B S ABHISHEK |SUDAMA CHOWK |KHODIYARNAGAR MOTA VARACHHA |SURAT |395006
3|000000066
获取此信息
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251 |VIJAYPANDEY1191@GMAIL.COM |VIJAY PANDEY |PART OF GROUND FLOOR & BASEMENT |SHOPPER STOP SV ROAD ANDHERI WEST |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482 |MAKSUDMASTER7775@GMAIL.COM |MOHAMAD MAQSHUD MASTER |H COLLECTION NEW SHIVPURI |GALI NO 1 |NEAR MAKHAN SINGH CHOWK |LUDHIANA |141008
1|41008|
2|0001001010000235008|531849|2344|101|09131973|00919375580888 |JAYNTITALA@GMAIL.COM |JAYANTIBHAI TADA |44 KHODIYAR NAGAR B S ABHISHEK |SUDAMA CHOWK |KHODIYARNAGAR MOTA VARACHHA |SURAT |395006
3|95006
3|000000066
答案 0 :(得分:4)
使用GNU awk的FIELDWIDTHS:
$ awk -v FIELDWIDTHS='1 17 4 *' -v OFS='|' '/^2/{$1=$1; gsub(/\s+/,"&"OFS)} 1' file
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659 |VIDYA.SAGAR1@bank.IN |VIDYA |SAGAR |CROSS |BANDRA |WM |DELHI |456471
3000000027
FIELDWIDTHS的上述用法表示,应将输入分为四个宽度分别为1个字符,17个字符,4个字符的字段,然后将其余部分分开。
当您为字段分配值时,awk将重新编译记录,用OFS的值替换输入字段分隔符,因此$ 1 = $ 1导致|
插入到FIELDWIDTHS描述的每个字段之间。 / p>
完成此操作后,仍然需要使用所有剩余的以空格分隔的文本来添加字段分隔符,以便gsub()在每一系列空格之后添加一个OFS。
较早版本的gawk不支持*
的含义the rest of the line
-如果遇到这种情况,只需将*
替换为99999
之类的大值。
答案 1 :(得分:1)
您也可以尝试Perl
perl -lpe ' if(/^2/) { @x=(1,17,4);
for $i (@x) { s/(.{$i})//; printf("%s|",$1) } }' input_file
具有给定的输入
$ cat rahman.txt
1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659 VIDYA.SAGAR1@bank.IN VIDYA SAGAR CROSS BANDRA WM DELHI 456471
3000000027
$ perl -lpe ' if(/^2/) { @x=(1,17,4);
for $i (@x) { s/(.{$i})//; printf("%s|",$1) } }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659 VIDYA.SAGAR1@bank.IN VIDYA SAGAR CROSS BANDRA WM DELHI 456471
3000000027
$
只需将条目添加到@ x =(1,17,4).. @ x =(1,17,4,10,20)
EDIT1:
要为可按空格分割的字段添加定界符,请使用以下
$ perl -lpe ' if(/^2/) { @x=(1,17,4);
for $i (@x) { s/(.{$i})//; printf("%s|",$1) } s/\S+\s+\K/|/g }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659 |VIDYA.SAGAR1@bank.IN |VIDYA |SAGAR |CROSS |BANDRA |WM |DELHI |456471
3000000027
$
代码解释
Explanation
perl -lpe # use -p for printing by default at the end of perl one-liner
# this makes sure when you dont have a line starting with 2 the line is printed after the if statement.
' if(/^2/) # if - select line that starts with 2. $_ will have the current line
{
@x=(1,17,4); # x is an array to hold the widths of fields. - 1, 17, 4
for $i (@x) # open for loop to loop through the array x
{
s/(.{$i})//; # no variable is specified, so the substitution acts on the $_ i.e current line
# first instance is s/(.{1})// => match one character and store it in $1 capturing variable
# replace the captured part with nothing and update $_
# e.g if the line is "200010010100001151" .. loop one will capture "2" and $_ becomes "00010010100001151"
# loop 2 => s/(.{17})// matches 17 character and $1 stores "00010010100001151"
printf("%s|",$1) # print $1 along with delimiter pipe
} # end of for loop
} # end of if
# here is default print statement in perl that will print the $_ after all modification
' input_file
EDIT2
根据您的输入,我得到的结果如下。它可以正常工作..您看到什么问题?
$ perl -ne ' if(/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
> while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
> print "$_"} if(/^1/) { @x=(1,16,5,8); $i=0;
> while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
> print "$_" } if(/^3/) { @x=(1); $i=0;
> while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
> print "$_" }' rahman.txt
1|YBL LOYALTY EXT |10001|01172019|001
2|0001001010000115127|531849|0730|101|06141991|00919699034659 |VIDYA.SAGAR1@bank.IN VID|YA SAGAR CRO|SS BAN|DRA WM | DEL|HI 456|471
3|000000027
$
EDIT3:
解决了这个问题... $ _被修改,因此在/ ^ 2 / if循环的末尾,$ _保持值为“ 141008”,然后满足下一个if(/ ^ 1 /)要避免这种情况,只需在开始时将$ _复制到$ line变量中,然后在单独的if循环中针对/ ^ 2 /,/ ^ 3 /,/ ^ 1 /检查$ line
$ perl -lne '$line=$_; if($line=~/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }
if($line=~/^1/) { @x=(1,16,5,8); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }
if($line=~/^3/) { @x=(1); $i=0;
while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
print "$_" }' rahman2.txt
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251 |VIJAYPANDEY1191@GMAIL.COM |VIJAY PANDEY |PART OF GROUND FLOOR & BASEMENT |SHOPPER STOP SV ROAD ANDHERI WEST |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482 |MAKSUDMASTER7775@GMAIL.COM |MOHAMAD MAQSHUD MASTER |H COLLECTION NEW SHIVPURI |GALI NO 1 |NEAR MAKHAN SINGH CHOWK |LUDHIANA |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888 |JAYNTITALA@GMAIL.COM |JAYANTIBHAI TADA |44 KHODIYAR NAGAR B S ABHISHEK |SUDAMA CHOWK |KHODIYARNAGAR MOTA VARACHHA |SURAT |395006
3|000000066
$
答案 2 :(得分:0)
文件中确实有定界符,只是看不到它们:空格/制表符。因此,您只需要使用sed/xxx/|/g
命令替换它们(xxx
是指空格或TAB字符)。如果您不确定字符是空格还是制表符,则可以在十六进制编辑器中打开文件(空格为ASCII代码32(十六进制:20),TAB为9(十六进制:09))。
答案 3 :(得分:0)
您可以尝试使用gnu sed:
sed -E '/^2/{s//&|/;s/(.{19})(....)(\S+\s+)/\1|\2|\3|/}' infile
答案 4 :(得分:0)
如果您没有FIELDSWIDTHS
,请尝试遵循。
awk -v var="1,18,4" -v OFS="|" '
BEGIN{
num=split(var,array,",")
}
{
for(i=1;i<=num;i++){
val=val?(i==num?val substr($0,array[i-1]+1,array[i]):val substr($0,array[i-1]+1,array[i]) OFS):substr($0,1,array[i]) OFS
sum+=array[i]
}
if(sum==length($0)){
print val
}
else{
rest=substr($0,sum)
gsub(/[[:space:]]+/,"&"OFS,rest)
print val,rest
}
sum=rest=val=""
}
' Input_file