Question

我在目录中有大型tsv文件（每个> 1gb），我需要将每个文件拆分为80/20拆分。根据我对动力壳的有限知识，我在下面做了但它的地狱很慢。我知道我可以使用cygwin / bash在几毫秒内完成此操作，但我需要通过批处理文件自动执行此过程。我相信有更好更快的解决方案。

$DataSourceFolder="D:\Data"

$files = Get-ChildItem "$DataSourceFolder" -Filter *".tsv"

foreach ($file in $files)
    {
        $outputTrainfile="$DataSourceFolder\partitions\"+ $file.BaseName + "-train.tsv"
        $outputTestfile="$DataSourceFolder\partitions\"+ $file.BaseName + "-test.tsv"

        $filepath = "$DataSourceFolder\"+ $file
        # Get number of rows in the file
        Get-Content $filepath | Measure-Object | ForEach-Object { $sourcelinecount = $_.Count }
        # Get top and tail count to be fetched from source file
        $headlinecount = ($sourcelinecount * 80) /100
        $taillinecount = $sourcelinecount - $headlinecount

        # Create the files
        New-Item -ItemType file $outputTrainfile -force
        New-Item -ItemType file $outputTestfile -force

        #set content to the files 
        Get-Content $filepath -TotalCount $headlinecount | Set-Content $outputTrainfile
        Get-Content $filepath -Tail $taillinecount | Set-Content $outputTestfile
    }

Answer 1

对不起，迟迟未发布答案：希望它可以为其他人省力：

我使用bash.exe从Power Shell分割文件。速度与激情。

创建一个bash文件，然后从powershell对其进行调用，以将文件分割为所需的分区

Bash文件：例如：将其命名为“ Partition.sh”

foldername=$1
filenamePrefix=$2
$echo $foldername
$echo $filenamePrefix
for filename in $foldername/$filenamePrefix*.tsv
do
 $echo "Partitioning the $filename"
 cat $filename | shuf > tmp
 lines=$(wc -l tmp | cut -d' ' -f1)
 $echo "Read file successfully"
 head -n$(echo $lines*0.8/1 | bc) tmp > $filename.train.tsv
 tail -n$(echo $lines*0.2/1-1 | bc) tmp > tmp1 > $filename.test.tsv
 rm tmp tmp1 
done

从powerhshell调用：

bash.exe /mnt/c/Partition.sh /mnt/c/trainingData/ "FilePrefix"

Powershell - 迭代目录中的文件并将每个文件拆分为80％和20％

1 个答案: