我在目录中有大型tsv文件(每个> 1gb),我需要将每个文件拆分为80/20拆分。根据我对动力壳的有限知识,我在下面做了但它的地狱很慢。我知道我可以使用cygwin / bash在几毫秒内完成此操作,但我需要通过批处理文件自动执行此过程。我相信有更好更快的解决方案。
$DataSourceFolder="D:\Data"
$files = Get-ChildItem "$DataSourceFolder" -Filter *".tsv"
foreach ($file in $files)
{
$outputTrainfile="$DataSourceFolder\partitions\"+ $file.BaseName + "-train.tsv"
$outputTestfile="$DataSourceFolder\partitions\"+ $file.BaseName + "-test.tsv"
$filepath = "$DataSourceFolder\"+ $file
# Get number of rows in the file
Get-Content $filepath | Measure-Object | ForEach-Object { $sourcelinecount = $_.Count }
# Get top and tail count to be fetched from source file
$headlinecount = ($sourcelinecount * 80) /100
$taillinecount = $sourcelinecount - $headlinecount
# Create the files
New-Item -ItemType file $outputTrainfile -force
New-Item -ItemType file $outputTestfile -force
#set content to the files
Get-Content $filepath -TotalCount $headlinecount | Set-Content $outputTrainfile
Get-Content $filepath -Tail $taillinecount | Set-Content $outputTestfile
}
答案 0 :(得分:0)
对不起,迟迟未发布答案:希望它可以为其他人省力:
我使用bash.exe从Power Shell分割文件。速度与激情。
创建一个bash文件,然后从powershell对其进行调用,以将文件分割为所需的分区
Bash文件:例如:将其命名为“ Partition.sh”
foldername=$1
filenamePrefix=$2
$echo $foldername
$echo $filenamePrefix
for filename in $foldername/$filenamePrefix*.tsv
do
$echo "Partitioning the $filename"
cat $filename | shuf > tmp
lines=$(wc -l tmp | cut -d' ' -f1)
$echo "Read file successfully"
head -n$(echo $lines*0.8/1 | bc) tmp > $filename.train.tsv
tail -n$(echo $lines*0.2/1-1 | bc) tmp > tmp1 > $filename.test.tsv
rm tmp tmp1
done
从powerhshell调用:
bash.exe /mnt/c/Partition.sh /mnt/c/trainingData/ "FilePrefix"