我有大的.csv文件(~40MB),我想在几个条件下将它们分成较小的文件,并根据数据命名:
这里有一个棘手的部分:
我在VBA中有这样的东西,但对于大文件来说太慢了,而excel有时会崩溃。对于这样的多个文件,需要花费很长时间才能手动切断它们,然后将其放入工作中。
这可以在很多条件下剪切文件吗?
提前感谢您的帮助。
的exaple: (标题是列的#)
1 2 3 4 11 15 17
Date Time COUNTRY CITY CHECK TEST TEST2
2015-08-20 11:54 ENGLAND ABINGDON 1 1
2015-08-21 12:54 ENGLAND BATLEY 2 5
2015-08-22 13:54 ENGLAND FROME 2 6
2015-08-23 14:54 ENGLAND FROME 2 1
2015-08-24 15:54 USA CALIFORNIA 4 8
2015-08-25 16:54 USA CONNECTICUT 4 9
2015-08-26 17:54 USA DELAWARE 1 8
2015-08-27 18:54 GERMANY SAXONY 6 9
2015-08-28 19:54 GERMANY SAXONY 6 10
2015-08-27 18:54 GERMANY SAXONY 4 11
2015-08-28 19:54 GERMANY SAXONY 4 14
2015-08-29 20:54 GERMANY HESSE 8
2015-08-29 20:54 GERMANY HESSE 1 8
File1
2015-08-20 11:54 ENGLAND ABINGDON 1 1
File2
2015-08-21 12:54 ENGLAND BATLEY 2 5
File3
2015-08-22 13:54 ENGLAND FROME 2 6
File4
2015-08-23 14:54 ENGLAND FROME 2 1
File5
2015-08-24 15:54 USA CALIFORNIA 4 8
File6
2015-08-25 16:54 USA CONNECTICUT 4 9
File7
2015-08-26 17:54 USA DELAWARE 1 8
File8
2015-08-27 18:54 GERMANY SAXONY 4 9
2015-08-28 19:54 GERMANY SAXONY 4 10
File9
2015-08-27 18:54 GERMANY SAXONY 6 11
2015-08-28 19:54 GERMANY SAXONY 6 14
File10
2015-08-29 20:54 GERMANY HESSE 8
File11
2015-08-29 20:54 GERMANY HESSE 1 8
答案 0 :(得分:0)
您的数据到处都是!它不在您描述的列中,也不是以制表符分隔的。你不能让生活变得轻松!
尝试使用您的真实数据awk
来查看它是否会生成您可以使用的输出文件名。
awk -F'\t' '{
f=$3 "_" $4 # filename = field3 _ field4
if(length($11)){ # if f11 not null
f=f "_A_" $11 "_" $17 # filename = filename _A_ field11 _ field17
}else{ # else
f=f "_B_" $15 "_" $17 # filename = filename _B_ field15 _ field17
}
print f}' file.csv
你应该得到这样的东西
ENGLAND_ABINGDON_A_3_1
ENGLAND_ABINGDON_A_4_2
GENRMANY_SAXONY_B_5_3
基本上它使用awk
并告诉它你的字段分隔符是标签。然后,它会查看每一行,并通过查看您描述的字段在变量f
中构建输出文件名。
如果你的数据分开了你的意思,你可以通过简单地改变最后一行来实际输出当前行到相应名称的文件:
awk -F'\t' '{
f=$3 "_" $4 # filename = field3 _ field4
if(length($11)){ # if f11 not null
f=f "_A_" $11 "_" $17 # filename = filename _A_ field11 _ field17
}else{ # else
f=f "_B_" $15 "_" $17 # filename = filename _B_ field15 _ field17
}
print > f}' file.csv
基本上,如果您更改
,它会将打印到文件,而不是打印其名称print f
到
print > f
编制标题
如果你想在每个输出fie上有一个标题,我们需要更努力地工作......
首先,我们需要保存原始文件中的标题,因此如果我们假设它是记录号1,我们将会这样做
awk -F'\t' '
NR==1 {header=$0} # save first line as header
{f=$3 "_" $4 # filename = field3 _ field4
...
... as before
现在我们需要在开始写入新文件时输出标题行,这是" fun" 因为我们只是为每一行动态创建输出文件名!因此,我们需要"记住" 我们写入的文件,然后在我们写入新文件时只发出标题。我在这里没有一套像样的数据,所以我猜这一点!
awk -F'\t' '
NR==1 {header=$0} # save first line as header
{f=$3 "_" $4 # filename = field3 _ field4
if(length($11)){ # if f11 not null
f=f "_A_" $11 "_" $17 # filename = filename _A_ field11 _ field17
}else{ # else
f=f "_B_" $15 "_" $17 # filename = filename _B_ field15 _ field17
}
# Emit header if first write to this filename
if(!(f in fileswritten)){
fileswritten[f]++ # note that we have written to this file
print header > f # emit header
}
print > f}' file.csv
答案 1 :(得分:0)
这个答案不完整,但大致说明了你需要做的事情:
#!/bin/bash
# Get list of countries:
countries=`cat file1.csv | cut -f 3 -d$'\t'| grep -v 3 | grep -v COUNTRY | uniq`
for country in ${countries}; do
# Get list of cities per country:
cities=`cat file1.csv | grep ${country} | cut -f 4 -d$'\t' | uniq`
# Get data per country:
cat file1.csv | grep ${country} > file1-${country}.csv
# Get data per city per country:
for city in ${cities}; do
echo ${country}:${city}
cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}.csv
done
# Created output by 2 previous operations check if there is any data in 11th column,
# if yes then separate this data accordingly to content and after that separate that
# by content of 17th column -> then save outputs /OR / AND /
# Column 11 is at position 5 in your data.
check=`cat file1.csv | grep ${country} | cut -f 5 -d$'\t' | uniq`
for check in ${checks}; do
echo ${country}:${city}:${check}
cat file1.csv | grep ${country} | grep ${city} > file1-${country}-${city}-${check}.csv
# TODO: Further split this, I assume you get the drift by now.
done
# If there is no data in column 11 check column 15th and separate accordingly.
# Next check 17 column and separate this data by 17th column -> save outputs.
# TODO: Further split this, I assume you get the drift by now.
done
答案 2 :(得分:0)
我建议编写一个小脚本并使用java库CSVFormat:
private static final String[] FILE_HEADER_MAPPING = {"Date", "Time" ,"COUNTRY", .... };
csvFileParser = new CSVParser(fileReader, csvFileFormat);
List<CSVRecord> csvRecords = csvFileParser.getRecords();
然后访问第11列,你必须
for (int i = 1; i < csvRecords.size(); i++) {
boolean publishAccount = true;
CSVRecord record = csvRecords.get(i);
/// here how to access
record.get("Fiel column 11");
}