我只想将列的最后3个字符返回到原始文件

时间:2016-04-17 22:56:55

标签: csv awk sed

我的数据的前两行:

"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","123427","456060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"

我只想要第2列和第3列的最后3个字符,我不希望列标题受到影响。 很高兴能够首先执行column2然后执行第3列的解决方案

我现在正在摆弄sed和awk,但还没有快乐。

这就是我想要的:

"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"

edit1 这给了我最后3位数字(+“),只需要把它写回原始文件?

$ awk -F"," 'NR>1{ print $2}' head_test_real.csv | sed 's/.*\(....\)/\1/'
427"
592"
007"
592"
409"
742"
387"
731"
556"

edit2 这有效,但我输了双引号“123427”转到427,我想保留双引号。
 * NR> 1适用于第1行之后的行。

$ awk -F, 'NR>1{$2=substr($2,length($2)-3,3)}1' OFS=, head_test_real.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06",427,"456060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"

edit3 @Mark来回答正确的答案,这里只是为了引用我的引用。

$ ####csv.QUOTE_ALL

$ cat out.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"



$ ####csv.QUOTE_MINIMAL

$ cat out.csv
Rec_Open_Date,MSISDN,IMEI,Data_Volume_Bytes,Device_Manufacturer,Device_Model,Product_Description
2015-10-06,427,060,137765,Samsung Korea,Samsung SM-G900I,$39 Plan

$ ###csv.QUOTE_NONNUMERIC

$ cat out.csv
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"



$ ###csv.QUOTE_NONE

$ cat out.csv
Rec_Open_Date,MSISDN,IMEI,Data_Volume_Bytes,Device_Manufacturer,Device_Model,Product_Description
2015-10-06,427,060,137765,Samsung Korea,Samsung SM-G900I,$39 Plan

3 个答案:

答案 0 :(得分:2)

虽然awk似乎非常适合以逗号分隔的数据,但它并不能很好地处理引用字段版本。我建议使用像Python附带的专用CSV处理库(23):

import csv
with open('in.csv','r') as infile:
  reader = csv.reader(infile)
  with open('out.csv','w') as outfile:
    writer = csv.writer(outfile,delimiter=',',quotechar='"',quoting=csv.QUOTE_ALL)
    writer.writerow(next(reader)) 

    for row in reader:
      row[1] = row[1][-3:]
      row[2] = row[2][-3:]
      writer.writerow(row)

将上述代码放入名为eg的文件中fixcsv.py并使文件名与您拥有和想要的文件名匹配,然后使用python fixcsv.py(或python3 fixcsv.py)运行它。

我将其设置为引用输出中的所有内容(QUOTE_ALL);如果您不希望这样做,可以将其设置为QUOTE_MINIMALQUOTE_NONNUMERICQUOTE_NONE

row分配会替换第二个和第三个字段(row[1]row[2],因为第一个字段为row[0]),其后三个字符为[-3:] })。您也可以使用例如算法进行算术。 row[1] = int(row[1]) % 1000

答案 1 :(得分:1)

Perl救援!

perl -pe 's/",".*?(...",")/","$1/ if $. > 1' < input > output
  • -p逐行读取输入并打印结果
  • s/regex/replacement/是替换
  • .*?匹配任何内容(例如.*),但问号会使其“节俭”,即它匹配可能的最短字符串
  • (...",")","之前创建一个从三个字符开始的捕获组,它可以被引用为$1
  • $.是行号,第1行没有替换。

确保始终引用前两列,第二列永远不会短于3个字符。

要修改第三列,可以将正则表达式修改为

perl -pe 's/^("(?:.*?","){2}).*?(...",")/$1$2/ if $. > 1'
#                         ~

修改指定的数字以处理您喜欢的任何列。

答案 2 :(得分:1)

$ awk 'BEGIN{FS=OFS="\",\""} NR>1{for (i=2;i<=3;i++) $i=substr($i,length($i)-2)} 1' file
"Rec_Open_Date","MSISDN","IMEI","Data_Volume_Bytes","Device_Manufacturer","Device_Model","Product_Description"
"2015-10-06","427","060","137765","Samsung Korea","Samsung SM-G900I","$39 Plan"

与任何命令一样,写回原始文件只是:

command file > tmp && mv tmp file