Question

我有一个包含500k行的CSV文件，我需要将其分成两组，每组分别为400k和100k。但是，我无法执行类似{ "Details": [{ "phone": "id": }], "address": [{ "location": "some location" }] } do { let json = try JSONSerialization.jsonObject(with: data!, options: []) as! [String: AnyObject] let location = json["location"] as? [[String: AnyObject]] print(location) } catch { print("error") }的操作，因为行已排序并且需要随机分布。
如何将两组随机化？顺便说一句，尺寸不必精确，
即awk 'NR < 100000' file.csv > subset1.csv和398111也是可接受的拆分，如果awk中不可能进行完美拆分。另外，我需要在两个输出文件中都包含标题行

Answer 1

split -l 400000 <(shuf file.csv)

希望这会帮到你。

Answer 2

awk。首先是一些示例文件：

$ seq 1 100 > file

然后是脚本：

$ awk '{print > (rand()<=0.2?"first":"second")}' file

和结果：

$ wc -l first second
 19 first
 81 second
100 total

From GNU awk documentation：警告： 在大多数awk实现中，包括gawk， rand（）开始从相同的起始数字或种子生成数字，每次运行awk--如果希望程序每次使用时都做不同的事情，则必须将种子更改为每次运行中都不同的值。为此，请使用 srand（）。就是您可能想将BEGIN{srand()}添加到脚本中。

编辑：要将所有内容收集到一个脚本中：

awk '
BEGIN {
    srand()                                # change the random seed 
}
NR==1 {
    print > "first"; print > "second"      # write the header to both files
    next                                   # skip to next record
}
{
    print > (rand()<=0.2?"first":"second") # print about every fifth record to first file
}' file

Answer 3

$ cat file.csv
header
1
2
3
4
5
6
7
8
9
10

$ awk 'NR==1{print > "big"; print > "small"; next} 1' file.csv |
shuf |
awk '{print >> (NR<=7 ? "big" : "small")}'

$ cat big
header
10
5
9
2
8
1
3

$ cat small
header
4
6
7

只需将7更改为400000。上面假设您不需要输出中的行顺序与输入中的行顺序相同。如果您确实关心输出顺序，则可以稍作调整：

$ awk -v OFS='\t' 'NR==1{print NR,$0 > "big"; print NR,$0 > "small"; next} {print NR,$0}' file.csv |
shuf |
awk '{print >> (NR<=7 ? "big" : "small")}'

$ sort -n big | cut -f2-
header
1
4
5
6
8
9
10

$ sort -n small | cut -f2-
header
2
3
7

awk分为随机子集

3 个答案: