我有big_file.csv,其中包含大量公司信息。这是一个片段
CompanyName, CompanyNumber,RegAddress.CareOf,...
"! # 1 AVAILABLE LOCKSMITH LTD","05905727","",...
"!NSPIRED LIMITED","06019953",""...
"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD","07981734",""...
我只需要CompanyName和CompanyNumber字段,所以我做了以下事情:
cut -d, -f 1,2 big_file.csv > big_file_names_codes_only.csv
如您所见(我理解为什么),big_file.csv中的第三个条目在第一个逗号(实际上是CompanyName的一部分)之后被剪切。我知道如何在sed中删除第一个逗号(但这会破坏整个csv strucutre),所以我想知道你们中是否有人知道如何从第一个逗号中删除逗号(它始终位于位置1)"string, with, commas, or not and non alphanum chars!"
。
所以基本上我想要的中间输出是:
CompanyName, CompanyNumber
"! # 1 AVAILABLE LOCKSMITH LTD","05905727"
"!NSPIRED LIMITED","06019953"
"CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD","07981734"
但是最后一行变成:
"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD"
获得中间输出后,我需要清除公司名称和前导空格中的所有非字母num字符-可以很好地解决此问题:
sed -i 's/[^a-zA-Z0-9 ,]//g; s/^[ \t]*//'
最后,我的文件应该是:
CompanyName, CompanyNumber,RegAddress.CareOf,...
AVAILABLE LOCKSMITH LTD,05905727
NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734
答案 0 :(得分:3)
最好使用实际上知道格式的工具来处理结构化数据(例如CSV文件),并在字段中嵌入逗号,而不是尝试将诸如正则表达式(与XML,JSON等相同)之类的东西混在一起。 。从长远来看,它要容易得多,并且可以为您处理大量与您的期望不完全相符的边缘情况和奇数数据带来麻烦。
csvkit实用程序集具有许多有用的命令行工具,通常可通过OS软件包管理器使用:
$ csvcut -c CompanyName,CompanyNumber blah.csv
CompanyName,CompanyNumber
! # 1 AVAILABLE LOCKSMITH LTD,05905727
!NSPIRED LIMITED,06019953
"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD",07981734
然后您可以继续使用sed删除不需要的字符。
(注意:我必须在示例数据的标题行中删除多余的空格才能使它起作用)
编辑:另外,使用方便的Text::AutoCSV模块的perl版本会去除字符:
$ perl -MText::AutoCSV -e 'Text::AutoCSV->new(out_fields => [ "COMPANYNAME", "COMPANYNUMBER" ],
read_post_update_hr => sub {
my $hr = shift;
$hr->{"COMPANYNAME"} =~ s/[^[:alnum:]\s]+//g;
$hr->{"COMPANYNAME"} =~ s/^\s+//;
})->write();' < blah.csv | sed -e 's/"//g'
CompanyName,CompanyNumber
1 AVAILABLE LOCKSMITH LTD,05905727
NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734
答案 1 :(得分:1)
使用awk
$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$2}' big_file.csv
CompanyName, CompanyNumber
"! # 1 AVAILABLE LOCKSMITH LTD","05905727"
"!NSPIRED LIMITED","06019953"
"CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD","07981734"
我的建议是使用R
,Python
,Perl
之类的编程语言来完成此类任务
答案 2 :(得分:1)
类似于@Sonny的解决方案,但是使用GNU awk的gsub
函数根据您的输出期望修剪输出中的引号和逗号,并优先将引号中包含的字段优先于不包含在引号中的字段:>
awk -vFPAT='("[^"]+")|([^,]*)' -vOFS=, '{for(n=1;n<3;++n)gsub(/^"|"$|,/,"",$n);print$1,$2}' big_file.csv
这将输出:
CompanyName, CompanyNumber
! # 1 AVAILABLE LOCKSMITH LTD,05905727
!NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734
答案 3 :(得分:1)
由于我不知道第一行有多少个逗号,但是如果仅使用公司名称和公司编号,则使用bash时,此命令可能是最短的命令: < / p>
在运行xargs后,最简单的摆脱不想要字符的方法是使用xargs -L1看起来更好:
xargs -L1
输出:
CompanyName, CompanyNumber,RegAddress.CareOf,...
! # 1 AVAILABLE LOCKSMITH LTD,05905727,,...
!NSPIRED LIMITED,06019953,...
CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD,07981734,...
现在,我们可以添加-f1,2,3剪切了,我猜您尝试过
xargs -L1 | cut -d, -f1,2,3
输出:
CompanyName, CompanyNumber,RegAddress.CareOf
! # 1 AVAILABLE LOCKSMITH LTD,05905727,
!NSPIRED LIMITED,06019953,...
CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD,07981734
好吧,现在我遇到了与您的示例相同的问题,由于添加了nr 3来进行剪切,所以我们也得到了LTD后面的数字,但是最后的多余字符仍然存在: >
sed 's/,...$//;s/,$//;s/, / /g' big_file.csv
我们将其分解:
sed 's/,...$//;s/,$//;s/, / /g' big_file.csv|xargs -L1|cut -d, -f1,2
CompanyName CompanyNumber,RegAddress.CareOf
! # 1 AVAILABLE LOCKSMITH LTD,05905727
!NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734
由于我在编辑之前忘记了逗号,所以找到了更好的解决方案:
sed 's/,\ / /g' big_file.csv|xargs -L1|cut -d, -f1,2
答案 4 :(得分:1)
使用Perl
$ perl -lne ' if($.>1) { /^"(.+?)","(.+?)"/ ;$x=$1;$y=$2; $x=~s/[,]//g; print "$x,$y" }
else { print } ' big_file.csv
CompanyName, CompanyNumber,RegAddress.CareOf,...
! # 1 AVAILABLE LOCKSMITH LTD,05905727
!NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734
$
答案 5 :(得分:1)
您可以尝试使用此sed:
sed -E '
:A
s/^("[^,"]*),(.*)/\1\2/
# label A if CompanyName can have more than 1 comma
tA
s/"//g;s/([^,]*,[^,]*).*/\1/
' big_file.csv
答案 6 :(得分:0)
awk
是你的朋友
也许有帮助
➜ ~ awk 'BEGIN {FS="\",\""} { printf "%s, %s \n",$1,$2 }' big_file.csv | tr -d '\"'
CompanyName, CompanyNumber,RegAddress.CareOf,...,
! # 1 AVAILABLE LOCKSMITH LTD, 05905727
!NSPIRED LIMITED, 06019953
CENTRE FOR COUNSELLING, PSYCHOTHERAPY AND TRAINING LTD, 07981734
答案 7 :(得分:0)
使用GNU awk进行FPAT:
$ cat tst.awk
BEGIN { FPAT="\"[^\"]+\"|[^,]*"; OFS="," }
NR == 1 { print; next }
{
for (i=1; i<=NF; i++) {
gsub(/[^[:alnum:]]+/," ",$i)
gsub(/^ | $/,"",$i)
}
print $1, $2
}
$ awk -f tst.awk file
CompanyName, CompanyNumber,RegAddress.CareOf,...
1 AVAILABLE LOCKSMITH LTD,05905727
NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHOTHERAPY AND TRAINING LTD,07981734
如果您需要使用其他奇特的CSV,请参见What's the most robust way to efficiently parse CSV using awk?。
答案 8 :(得分:0)
根据以下两个答案的输入,我尝试了几种方法:
{"A":2.3,"B":3,"C":2.9},{"A":4.3,"B":11,"C":93}
sed -i '0,/ CompanyNumber/ s//CompanyNumber/' big_file.csv
这可行,但是速度非常慢。
sed 's/,\ / /g' big_file.csv | xargs -L1 | csvcut -c CompanyName,CompanyNumber > big_file_cleaned.csv
perl -lne ' if($.>1) { /^"(.+?)","(.+?)"/ ;$x=$1;$y=$2; $x=~s/[,]//g; print "$x,$y" } else { print } ' big_file.csv > big_file_clean.csv
谢谢
答案 9 :(得分:0)
awk 'NR>1{gsub(/"/,"")sub(/.{4}$/,"")gsub(/!|,$/,"")sub(/, /," ")sub(/.{5}A/,"A")}1' file
CompanyName, CompanyNumber,RegAddress.CareOf,...
AVAILABLE LOCKSMITH LTD,05905727
NSPIRED LIMITED,06019953
CENTRE FOR COUNSELLING PSYCHAPY AND TRAINING LTD,07981734