根据关键字排序文件,需要更多的数据库解决方案

时间:2017-07-19 16:38:50

标签: sorting batch-file batch-processing

我正在创建一个脚本,通过检查文件中的已知关键字将视频文件分类到文件夹中。随着关键字数量的增长失控,脚本变得非常慢,每个文件需要花费几秒钟来处理。

@echo off    
cd /d d:\videos\shorts
if /i not "%cd%"=="d:\videos\shorts" echo invalid shorts dir. && exit /b

:: auto detect folder name via anchor file
for /r %%i in (*spirit*science*chakras*) do set conspiracies=%%~dpi
if not exist "%conspiracies%" echo conscpiracies dir missing. && pause && exit /b
for /r %%i in (*modeselektor*evil*) do set musicvideos=%%~dpi
if not exist "%musicvideos%" echo musicvideos dir missing. && pause && exit /b

for %%s in (*) do set "file=%%~nxs" & set "full=%%s" & call :count
for %%v in (*) do echo can't sort "%%~nv"
exit /b

:count
set oldfile="%file%"
set newfile=%oldfile:&=and%
if not %oldfile%==%newfile% ren "%full%" %newfile%

set count=0
set words= & rem
echo "%~n1" | findstr /i /c:"music" >nul && set words=%words%, music&& set /a count+=1
echo "%~n1" | findstr /i /c:"official video" >nul && set words=%words%, official video&& set /a count+=2
set words=%words:has, =has %
set words=%words: , =%
if not %count%==0 echo "%file%" has "%words%" %count%p for music videos
set musicvideoscount=%count%

set count=0
set words= & rem
echo "%~n1" | findstr /i /c:"misinform" >nul && set words=%words%, misinform&& set /a count+=1
echo "%~n1" | findstr /i /c:"antikythera" >nul && set words=%words%, antikythera&& set /a count+=2
set words=%words:has, =has %
set words=%words: , =%
if not %count%==0 echo "%file%" has "%words%" %count%p for conspiracies
set conspiraciescount=%count%

set wanted=3
set winner=none

:loop
:: count points and set winner (in case of tie lowest in this list wins, sort accordingly)
if %conspiraciescount%==%wanted% set winner=%conspiracies%
if %musicvideoscount%==%wanted% set winner=%musicvideos%
set /a wanted+=1
if not %wanted%==15 goto loop

if not "%winner%"=="none" move "%full%" "%winner%" >nul && echo "%winner%%file%" && echo.

注意每个关键字的“权重值”。它计算每个类别的总点数,找到最大值并将文件移动到指定给该类别的文件夹。它还显示它找到的单词,最后列出它找不到的文件,这样我就可以添加关键字或调整权重值。

我已将此示例中的文件夹和关键字数量减少到最低限度。完整脚本有六个文件夹,大小为64k,包含所有关键字(并且还在增长)。

1 个答案:

答案 0 :(得分:3)

@ECHO OFF
SETLOCAL
SET "sourcedir=U:\sourcedir"
SET "tempfile=%temp%\somename"
SET "categories=music conspiracies"
REM SET "categories=conspiracies music"
(
 FOR /f "tokens=1,2,*delims=," %%s IN (q45196316.txt) DO (
 FOR /f "delims=" %%a IN (
  'dir /b /a-d "%sourcedir%\*%%u*" 2^>nul'
  ) DO (
   ECHO %%a^|%%s^|%%t
 )
)
)>"%tempfile%"

SET "lastname="

FOR /f "tokens=1,2,*delims=|" %%a IN ('sort "%tempfile%"') DO (
 CALL :resolve %%b %%c "%%a"
)
:: and the last entry...
CALL :resolve dummy 0 

GOTO :EOF

:resolve
IF "%~3" equ "%lastname%" GOTO accum
:: report and reset accumulators
IF NOT DEFINED lastname GOTO RESET
SET "winner="
SET /a maxfound=0
FOR %%v IN (%categories%) DO (
 FOR /f "tokens=1,2delims=$=" %%w IN ('set $%%v') DO CALL :compare %%w %%x
)
IF DEFINED winner ECHO %winner% %lastname:&=and%
:RESET
FOR %%v IN (%categories%) DO SET /a $%%v=0
SET "lastname=%~3"
:accum
SET /a $%1+=%2

GOTO :eof

:compare
IF %2 lss %maxfound% GOTO :EOF 
IF %2 gtr %maxfound% GOTO setwinner
:: equal scores use categories to determine
IF DEFINED winner GOTO :eof
:Setwinner
SET "winner=%1"
SET maxfound=%2
GOTO :eof

您需要更改sourcedir的设置以适合您的具体情况。

我使用了一个名为q45196316.txt的文件,其中包含我的测试类别数据。

music,6,music
music,8,Official video
conspiracies,3,misinform
conspiracies,6,antikythera
missing,0,not appearing in this directory

我认为您的问题是重复执行findstr非常耗时。

此方法使用包含category,weight,mask行的数据文件。 categories变量包含按优先顺序排列的类别列表(当分数相等时)

读取数据文件,将类别分配给%%s,将权重分配给%%t并屏蔽到%%u,然后使用掩码进行目录扫描。对于找到的每个匹配项,这将echoname|category|weight格式到临时文件的一行。第一次扫描后,dir似乎非常快。

因此,生成的临时文件将为每个文件名+类别加上权重添加一行,因此如果文件名适合多个类别,则将创建多个条目。

然后我们扫描该文件的已排序版本并解析得分。

首先,如果文件名发生变化,我们可以报告最后的文件名。这是通过比较变量$categoryname中的值来完成的。由于这些是按%categories%的顺序扫描的,因此如果存在等分的分数,则选择列表中的第一个类别。然后重置分数并将lastname初始化为新文件名。

然后我们将分数累积到$categoryname

所以 - 我相信会更快一些。

修订

@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION
SET "sourcedir=U:\sourcedir"
SET "tempfile=%temp%\somename"
SET "categories="rock music" music conspiracies"
REM SET "categories=conspiracies music"
:: set up sorting categories
SET "sortingcategories="
FOR %%a IN (%categories%) DO SET "sortingcategories=!sortingcategories!,%%~a"
SET "sortingcategories=%sortingcategories: =_%"
:: Create "tempfile" containing lines of name|sortingcategory|weight
(
 FOR /f "tokens=1,2,*delims=," %%s IN (q45196316.txt) DO (
 SET "sortingcategory=%%s"
 SET "sortingcategory=!sortingcategory: =_!"
 FOR /f "delims=" %%a IN (
  'dir /b /a-d "%sourcedir%\*%%u*" 2^>nul'
  ) DO (
   ECHO %%a^|!sortingcategory!^|%%t^|%%s^|%%u
 )
)
)>"%tempfile%"

SET "lastname="

SORT "%tempfile%">"%tempfile%.s"

FOR /f "usebackqtokens=1,2,3delims=|" %%a IN ("%tempfile%.s") DO (

 CALL :resolve %%b %%c "%%a"
)
:: and the last entry...
CALL :resolve dummy 0 

GOTO :EOF
:: resolve by totalling weights (%2) in sortingcategories (%1) 
:: for each name (%3)
:resolve
IF "%~3" equ "%lastname%" GOTO accum
:: report and reset accumulators
IF NOT DEFINED lastname GOTO RESET
SET "winner=none"
SET /a maxfound=0
FOR %%v IN (%sortingcategories%) DO (
 FOR /f "tokens=1,2delims=$=" %%w IN ('set $%%v') DO IF %%x gtr !maxfound! (SET "winner=%%v"&SET /a maxfound=%%x)
)
ECHO %winner:_= % %lastname:&=and%
:RESET
FOR %%v IN (%sortingcategories%) DO SET /a $%%v=0
SET "lastname=%~3"
:accum
SET /a $%1+=%2

GOTO :eof

我添加了一些重要的评论。

现在,您可以在类别名称中包含空格 - 您需要在set catagories...语句中引用名称(用于报告目的)。

sortingcategories是自动派生的 - 它仅用于排序,只是名称中的任何空格被下划线替换的类别。

在创建临时文件时,类别将被处理为包含下划线(sortingcategory),当解析最终展示位置时,将删除下划线,返回类别名称。

现在应该适当地处理负权重。

- “不追加*”的进一步修订

 FOR /f "tokens=1-5delims=," %%s IN (q45196316.txt) DO (
 SET "sortingcategory=%%s"
 SET "sortingcategory=!sortingcategory: =_!"
 FOR %%z IN ("!sortingcategory!") DO (
  SETLOCAL disabledelayedexpansion
  FOR /f "delims=" %%a IN (
   'dir /b /a-d "%sourcedir%\%%~v%%u%%~w" 2^>nul'

在q45196316文件中添加2个额外的列

music,6,music,*,*
music,8,Official video,"",*
conspiracies,3,misinform,*,*
conspiracies,6,kythera,*anti,*
missing,0,not appearing in this directory,*,*
rock music,2,metal,*,*
conspiracies,-5,negative,*,*

for /f ... %%s现在生成%%v%%w,其中包含最后两列(tokens也是1-5

这些作为前缀和后缀应用于%%u命令中的dir。请注意,""应该用于 nothing ,因为两个连续的,被解析为单个分隔符。 ~中的v / w%%~v表示remove the quotes