我需要使用随机数据生成50百万行csv文件:如何优化此程序?

时间:2010-05-04 19:59:18

标签: rebol

下面的程序可以根据某些规范生成随机数据(此处示例为2列)

它可以在我的电脑上使用几十万行(应该依赖于RAM)。我需要扩展到数十万行。

如何优化程序直接写入磁盘?另外如何“缓存”解析规则的执行,因为它总是重复5000万次的相同模式?

注意:要使用下面的程序,只需输入generate-blocks,save-blocks输出就是db.txt

Rebol[]

specs: [
    [3 digits 4 digits 4 letters]
    [2 letters 2 digits]
]

;====================================================================================================================


digits: charset "0123456789"
letters: charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
separator: charset ";"

block-letters: [A B C D E F G H I J K L M N O P Q R S T U V W X Y Z]

blocks: copy []

generate-row: func[][
    Foreach spec specs [

        rule: [

            any [

                [
                    set times integer! [['digits (                          
                                repeat n times [                    
                                block: rejoin [block random 9]                          
                            ]

                            )
                            | 
                            'letters (repeat n times [                  
                                block: rejoin [ block to-string pick block-letters random 24]                       
                            ]

                            )
                        ]
                        |
                        [
                            'letters (repeat n times [block: rejoin [ block to-string pick block-letters random 24]                     
                            ]

                            )       
                            | 
                        'digits (repeat n times [block: rejoin [block random 9]]

                        )   
                        ]
                    ]
                    |
                    {"} any separator {"}
                ]

            ]

            to end

        ]
        block: copy ""
        parse spec rule
        append blocks block
    ]
]

generate-blocks: func[m][
  repeat num m [  
    generate-row
  ]
]

quote: func[string][
    rejoin [{"} string {"}]
]

save-blocks: func[file][
    if exists? to-rebol-file file [
        answer: ask rejoin ["delete " file "? (Y/N): "]
        if (answer = "Y") [
            delete %db.txt
        ]
    ]
    foreach [field1 field2] blocks [
        write/lines/append %db.txt rejoin [quote field1 ";" quote field2]
    ]
]

1 个答案:

答案 0 :(得分:2)

使用open with / direct和/ lines refinement直接写入文件而不缓冲内容:

file: open/direct/lines/write %myfile.txt
loop 1000 [
  t: random "abcdefghi"
  append file t
]
Close file

这将写入1000个随机行而不进行缓冲。 你也可以准备一个行块(比如10000行),然后直接写入文件,这比逐行写入要快。

file: open/direct/lines/write %myfile.txt
loop 100 [
  b: copy []
  loop 1000 [append b random "abcdef"]
  append file b
]
close file

这会快得多,100000行不到一秒钟。 希望这会有所帮助。

请注意,您可以根据需要更改数字100和1000的内存,并使用b:make block! 1000而不是b:copy [],它会更快。