源文件分隔符问题

时间:2013-11-26 15:12:16

标签: unix awk delimiter cut

我的源文件存在一个问题。考虑我在文件中有以下数据 -

"dfjsdlfkj,fsdkfj,werkj",234234,234234,,"dfsd,etwetr"

这里,分隔符是逗号,但有些字段使用逗号作为数据的一部分。这些字段用双引号括起来。我想从文件中提取几列。

如果我使用cut -d "," -f 1,3,那么我的输出就像 -

"dfjsdlfkj,werkj"

2 个答案:

答案 0 :(得分:1)

我建议您使用csv解析器。例如,有一个作为内置模块,因此您只需要导入它:

import sys 
import csv 

with open(sys.argv[1], newline='') as csvfile:
    csvreader = csv.reader(csvfile)
    csvwriter = csv.writer(sys.stdout)
    for row in csvreader:
        csvwriter.writerow([row[e] for e in (0,2)])

假设您的示例行位于名为infile的输入文件中,请将脚本运行为:

python3 script.py infile

产量:

"dfjsdlfkj,fsdkfj,werkj",234234

答案 1 :(得分:0)

您可以尝试以下方法:

awk -f getFields.awk input.txt

其中input.txt是您的输入文件,getFields.awk是:

{
    split("",a)
    splitLine()
    print a[1],a[3]
}

function splitLine(s,indq,t,r,len) {
# Assumptions: 
#  * spaces before or after commas are ignored
#  * spaces at beginning or end of line is ignored

# definition of a quoted parameter:
# - starts with: (^ and $ are regexp characters)
#   a) ^"
#   b) ,"
# - ends with:
#   a) "$
#   b) ",

    s=$0; k=1
    s=removeBlanks(s)
    while (s) {
        if (substr(s,1,1)=="\"")
            indq=2
        else {
            sub(/[[:blank:]]*,[[:blank:]]*"/,",\"",s)
            indq=index(s,",\"")
            if (indq) {
                t=substr(s,1,indq-1)
                splitCommaString(t)
                indq=indq+2
            }
        }
        if (indq) {
            s=substr(s,indq)
            sub(/"[[:blank:]]*,/,"\",",s)
            len=index(s,"\",")  #find closing quote
            if (!len) {
                if (match(s,/"$/)) {
                    len=RSTART-1
                }
                else 
                    len=length(s)
                r=substr(s,1,len)
                s=""
            } else {
                r=substr(s,1,len-1)
                s=substr(s,len+2)
            }
            a[k++]=r
        } else  {
            splitCommaString(s)
            s=""
        }
    }
    k=k-1
}

function splitCommaString(t,b,i) {
    n=split(t,b,",")
    for (i=1; i<=n; i++)
        a[k++]=removeBlanks(b[i])
}       

function removeBlanks(r) {
    sub(/^[[:blank:]]*/,"",r)
    sub(/[[:blank:]]*$/,"",r)
    return r
}