Question

sh noob所以要温柔。这是一个使用命令行的预处理练习（我在 mac 上）。

我有一个大的 CSV 文件（ original.csv ）〜1M行，4列。我想创建处理脚本，根据列值拉出所有行，即获取所有不同的行。第1列中有138393个不同的值。我通过awk执行上述操作。

从这里开始，我想取出这些找到的值中的一半，对行进行随机播放（或随机选择），然后将两个组分成两个 CSV 文件（ file1.csv 和 file2.csv ）。 FWIW是一个机器学习练习，所以将数据分成测试/训练。

这是一种有效的方法吗？我现在拥有的最大瓶颈，（可能更多我看不到）：

通过awk
IO 将行复制到单独的 csv ，然后通过每个 csv +将一半的值附加到 train.csv < / em>和 test.csv

随机播放上面的每个文件

... BONUS：任何多线程解决方案都可以加速整个过程！

我的 CSV 数据是基本的（并已按第1列值排序）：

1,2,3.5,1112486027 1,29,3.5,1112484676 1,32,3.5,1112484819 1,47,3.5,1112484727

CODE：

#!/bin/bash DATA_FILE=noheader.csv awk -F "," '{ print >> ("r"$1".csv"); close("r"$1".csv") }' $DATA_FILE # Creates seperate CSV file for each userID ID_FILE=unique_ids.txt if [ -e $ID_FILE ] then IDX=$(wc -l unique_ids.txt | awk '{print $1}') # Get count of total rows in CSV printf "Found %d userIDs \n" $IDX else printf "File %s Not Found! \n" "$ID_FILE" printf "Creating Unique IDs File \n" cut -d , -f1 $DATA_FILE | sort | uniq > unique_ids.txt fi COUNT=0 START=$(date +%s) for ((i=1; i <= $IDX; i++)) # Iterate through each user CSV file { FILE=r${i}.csv TOT_LNO=$(wc -l $FILE | awk -v FILE="$FILE" '{ print $1; close(FILE) }') # Calc total number of rows in file SPLT_NO=$(($TOT_LNO / 2)) # ~50% split of user row count for test/train split gshuf -n $TOT_LNO $FILE # Randomly shuffle rows in csv file head -n $SPLT_NO $FILE >> train_data.csv OFFSET=$(($SPLT_NO + 1)) # Appends first line# rows of user{n} ratings to training data tail -n +$OFFSET $FILE >> test_data.csv # Appends rows nums > line# of user{n} ratings to test data # awk 'FNR==NR{a[$1];next}($1 in a){print}' file2 file1 # Prints out similarities btwn files (make sure not train/test splipapge) rm $FILE # Deletes temp user rating files before proceding ((COUNT++)) if ! ((COUNT % 10000)) then printf "processed %d files!\n" $COUNT fi } END=$(date +%s) TIME=$((END-START)) printf "processing runtime: %d:\n" $TIME

OUTPUT（假设它被洗牌）：

train.csv 1,2,3.5,1112486027 1,47,3.5,1112484727 test.csv 1,32,3.5,1112484819 1,29,3.5,1112484676

Answer 1

我猜是因为你没有提供我们可以测试的样本输入和预期输出，但听起来就像你需要的那样：

shuf infile.csv | awk -F, '$1==1{ print > ("outfile" (NR%2)+1 ".csv") }'

如果那不是您想要的，那么编辑您的问题以包含简明，可测试的样本输入和预期输出。

Answer 2

下面的方法比the accepted awk answer略快。

使用shuf， GNU split的{{1}}选项和-n：

mv

这不适用于 Mac ，因为那些使用BSD split没有grep '^1,' noheader.csv | shuf | split -n r/2 ; mv xaa train.csv ; mv xab test.csv选项。

根据列值从CSV复制行;然后分成单独的混洗CSV

2 个答案: