由于多余的逗号,无法使用分隔符解析csv

时间:2019-06-04 14:04:22

标签: csv parsing batch-file

我目前正在尝试批量解析csv文件,但由于开头“ ------,----”内的逗号过多而无法解析。另外,某些cvs文件不包含此字段,因此我无法仅移动令牌。这是csv文件的示例:

Datasheets,Image,Digi-Key Part Number,Manufacturer Part Number,Manufacturer,Description,Quantity Available,Factory Stock,Unit Price (USD),@ qty,Minimum Quantity,"Packaging","Series","Part Status","Capacitance","Tolerance","Voltage - Rated","Dielectric Material","Number of Capacitors","Circuit Type","Temperature Coefficient","Ratings","Mounting Type","Package / Case","Size / Dimension","Height - Seated (Max)"
"//media.digikey.com/pdf/Data%20Sheets/Panasonic%20Capacitors%20PDFs/ECJ-R,ECJ-T_4-Array.pdf",//media.digikey.com/photos/Panasonic%20Photos/ECJ-R%201206%20SERIES.jpg,P10582TR-ND,ECJ-RVC1H150K,Panasonic Electronic Components,CAP ARRAY 15PF 50V NP0 1206,0,0,"Obsolete","0","4000","Tape & Reel (TR)","ECJ-R","Obsolete","15pF","±10%","50V","Ceramic","4","Isolated","C0G, NP0","-","Surface Mount","1206 (3216 Metric)","0.126"" L x 0.063"" W (3.20mm x 1.60mm)","0.037"" (0.95mm)"
"//media.digikey.com/pdf/Data%20Sheets/Panasonic%20Capacitors%20PDFs/ECJ-R,ECJ-T_4-Array.pdf",//media.digikey.com/photos/Panasonic%20Photos/ECJ-R%201206%20SERIES.jpg,P10582CT-ND,ECJ-RVC1H150K,Panasonic Electronic Components,CAP ARRAY 15PF 50V NP0 1206,1801,0,"0.45000","0","1","Cut Tape (CT)","ECJ-R","Obsolete","15pF","±10%","50V","Ceramic","4","Isolated","C0G, NP0","-","Surface Mount","1206 (3216 Metric)","0.126"" L x 0.063"" W (3.20mm x 1.60mm)","0.037"" (0.95mm)"
"//media.digikey.com/pdf/Data%20Sheets/Panasonic%20Capacitors%20PDFs/ECJ-R,ECJ-T_4-Array.pdf",//media.digikey.com/photos/Panasonic%20Photos/ECJ-R%201206%20SERIES.jpg,P10582DKR-ND,ECJ-RVC1H150K,Panasonic Electronic Components,CAP ARRAY 15PF 50V NP0 1206,1801,0,"Digi-Reel","0","1","Digi-Reel®","ECJ-R","Obsolete","15pF","±10%","50V","Ceramic","4","Isolated","C0G, NP0","-","Surface Mount","1206 (3216 Metric)","0.126"" L x 0.063"" W (3.20mm x 1.60mm)","0.037"" (0.95mm)"
"//media.digikey.com/pdf/Data%20Sheets/Panasonic%20Capacitors%20PDFs/ECJ-R,ECJ-T_4-Array.pdf",//media.digikey.com/photos/Panasonic%20Photos/ECJ-R%201206%20SERIES.jpg,P10580TR-ND,ECJ-RVC1H100F,Panasonic Electronic Components,CAP ARRAY 10PF 50V NP0 1206,0,0,"Obsolete","0","4000","Tape & Reel (TR)","ECJ-R","Obsolete","10pF","±1pF","50V","Ceramic","4","Isolated","C0G, NP0","-","Surface Mount","1206 (3216 Metric)","0.126"" L x 0.063"" W (3.20mm x 1.60mm)","0.037"" (0.95mm)"
"//media.digikey.com/pdf/Data%20Sheets/Panasonic%20Capacitors%20PDFs/ECJ-R,ECJ-T_4-Array.pdf",//media.digikey.com/photos/Panasonic%20Photos/ECJ-R%201206%20SERIES.jpg,P10580CT-ND,ECJ-RVC1H100F,Panasonic Electronic Components,CAP ARRAY 10PF 50V NP0 1206,0,0,"Obsolete","0","1","Cut Tape (CT)","ECJ-R","Obsolete","10pF","±1pF","50V","Ceramic","4","Isolated","C0G, NP0","-","Surface Mount","1206 (3216 Metric)","0.126"" L x 0.063"" W (3.20mm x 1.60mm)","0.037"" (0.95mm)"

这是我的代码示例:

FOR /F "skip=1 tokens=3-6 delims=, " %%A IN (File.csv) DO (

ECHO %%A,%%B,%%D,%%C
)

2 个答案:

答案 0 :(得分:2)

这个问题很有趣。几周前,我用值中的逗号解决了very similar problem where a FOR /F needed to parse a CSV的问题。 My answer包括一个纯批处理解决方案。在该答案中,我还解释了许多使纯批处理CSV解析变得困难的问题。

我已经将该技术重构为以下可重用的:processLine:decodeToken例程。这些例程要求在主处理循环之前启用延迟扩展。该技术旨在将每个FOR / F令牌值放入一个类似命名的环境变量中。除去引号,并将值内的""(如果存在)加倍,减少为"

顶部的外部循环调用例程,将所有"加倍,对字段重新排序,并将每个字段括在引号内。可以轻松地重新构造外循环以执行所需的任何操作。底部的:processLine:parseToken例程无需更改。

下面的代码比aschipfl answer快5倍。输出是相同的,除了我的代码将每个字段都括在引号中,即使那些不需要的地方也是如此。 CSV完全可以接受。

@echo off
setlocal enableDelayedExpansion
for /f usebackq^ delims^=^ eol^= %%A in ("test.csv") do (
  call :processLine A ln
  for /f "tokens=3-6 delims=," %%A in ("!ln!") do (
    for %%v in (A B C D) do call :decodeToken %%v
    echo "!A:"=""!","!B:"=""!","!D:"=""!","!C:"=""!"
  )
)
exit /b


::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: The following routines will work for any CSV as long as no field contains \n
:: and no line approaches the 8191 character limit.

:processLine  forVarCharIn  envVarOut
::
:: Prepares CSV line stored in FOR variable %%forVarIn to be safely parsed by
:: FOR /F with delayed expansion enabled. The result is stored in environment
:: variable envVarOut.
:: 
:: All "" become "
:: All @ become @a
:: All quoted , become @c
:: All ^ become ^^
:: All ! become ^!
:: All fields are enclosed within quotes
:: 
setlocal
setlocal disableDelayedExpansion
for %%. in (.) do set "ln=%%%1"
set "ln=,%ln:"=""%,"
set "ln=%ln:^=^^^^%"
set "ln=%ln:&=^&%"
set "ln=%ln:|=^|%"
set "ln=%ln:<=^<%"
set "ln=%ln:>=^>%"
set "ln=%ln:!=^^!%"
set "ln=%ln:,=^,^,%"
set ^"ln=%ln:""="%^"
set "ln=%ln:"=""%"
set "ln=%ln:@=@a%"
set "ln=%ln:^,^,=@c%"
endlocal & set "ln=%ln:""="%" !
set "ln=!ln:,,"=,,!"
set "ln=!ln:",,=,,!"
set "ln=!ln:~2,-2!"
set "ln=!ln:^=^^^^!"
endlocal&set "%2=%ln:!=^^^!%"
set "%2=!%2:""="!"
set "%2="!%2:,,=","!"" !
exit /b

:decodeToken  V
::
:: Decodes field in %%V and stores in environment variable V
:: All @c become ,
:: All @a become @
::
for %%. in (.) do set "%1=%%~%1" !
if defined %1 (
  set "%1=!%1:@c=,!"
  set "%1=!%1:@a=@!"
)
exit /b

如果您确定所有值都不包含"文字,那么可以将顶部的循环简化为:

@echo off
setlocal enableDelayedExpansion
for /f usebackq^ delims^=^ eol^= %%A in ("test.csv") do (
  call :processLine A ln
  for /f "tokens=3-6 delims=," %%A in ("!ln!") do (
    for %%v in (A B C D) do call :decodeToken %%v
    echo "!A!","!B!","!D!","!C!"
  )
)
exit /b

更好的是,由于您要保留的列都不包含@,",因此可以大大简化top循环,而无需使用{{1} },将性能提高2倍(比aschipfl的答案快10倍):

:parseToken

这些例程将与任何CSV一起使用,只要所有CSV值都不包含换行符,并且所有已处理的行均不超过批处理施加的8191个字符限制。

此外,所有简单的FOR / F技术都限于最多解析32个令牌。在DosTips上,我演示了how to parse and process hundreds of CSV fields。它需要一些复杂的批处理编码,但是这些例程又可重复使用,因此外循环易于管理。

答案 1 :(得分:1)

这里是一种纯方法,它允许提取和重新排列CSV文件的指定列。列索引及其顺序需要在脚本顶部的常量_LIST中进行定义:

@echo off
setlocal EnableExtensions DisableDelayedExpansion

rem // Define constants here:
set "_FILE=%~1"     & rem // (input CSV file; `%~1` is first argument)
set "_LIST=3 4 6 5" & rem // (list of one-based column indexes to return)

rem // Define temporary replacements into pseudo-array `$REPL[]`:
call :SUBSTARR $REPL

rem // Read input CSV file line by line:
for /F "delims=" %%L in ('findstr /N "^" "%_FILE%"') do (
    set "LINE=%%L"
    set /A "INUM=0, LNUM=LINE"
    setlocal EnableDelayedExpansion
    set "LINE=!LINE:*:=!"
    rem // Temporarily substitute standard token delimiters but `,`:
    if defined LINE set "LINE=!LINE:\=\b!"
    call :REPLCHAR LINE LINE "^!" "\m"
    for /F "tokens=2* delims=[=]" %%M in ('set $REPL') do (
        if "%%N" == "" (
            call :REPLCHAR LINE LINE "=" "%%M"
        ) else if "%%N" == "*" (
            call :REPLCHAR LINE LINE "*" "%%M"
        ) else (
            if defined LINE set "LINE=!LINE:%%N=%%M!"
        )
    )
    rem // Split line (row) into comma-separated items (fields, cells):
    for %%I in ('!LINE:^,^='^,'!') do (
        endlocal
        set /A "INUM+=1"
        set "ITEM=%%I"
        setlocal EnableDelayedExpansion
        set "ITEM=!ITEM:','=,!"
        for /F "delims=" %%J in ("$ITEM[!INUM!]=!ITEM:~1,-1!") do (
            endlocal & set "%%J"
            setlocal EnableDelayedExpansion
        )
    )
    rem // Rebuild line (row) as per specified list of column indexes:
    set "LINE=," & for %%I in (%_LIST%) do (
        if %%I gtr 0 if %%I leq !INUM! (
            set "LINE=!LINE!!$ITEM[%%I]!,"
        ) else set "LINE=!LINE!,"
    )
    rem // Revert substitution of standard token delimiters but `,`:
    for /F "tokens=2* delims=[=]" %%M in ('set $REPL') do (
        if "%%N" == "" (
            set "LINE=!LINE:%%M==!"
        ) else (
            set "LINE=!LINE:%%M=%%N!"
        )
    )
    call :REPLCHAR LINE LINE "\m" "^!"
    set "LINE=!LINE:\b=\!"
    rem // Return modified line (row):
    >&2 < nul set /P ="!LNUM!:"
    echo(!LINE:~1^,-1!
    endlocal
)

endlocal
exit /B


:NONPRINT
    rem // Obtain several non-printable characters:
    for /F "tokens=1-8 delims=#" %%S in ('
        forfiles /P "%~dp0." /M "%~nx0" /C ^
            "cmd /C echo/0x08#0x09#0x0B#0x0C#0x1A#0x1B#0x7F#0xFF"
    ') do (
        rem // Get back-space, horizontal & vertical tabulators and form-feed:
        set "_BS=%%S" & set "_HT=%%T" & set "_VT=%%U" & set "_FF=%%V"
        rem // Get substitute (end-of-file), escape, delete and fixed space:
        set "_SS=%%W" & set "_ES=%%X" & set "_DE=%%Y" & set "_XX=%%Z"
    )
    exit /B


:SUBSTARR  <rtn_array>
    rem // Obtain non-printable token delimiters:
    call :NONPRINT
    rem // Define substitutions by a pseudo-array:
    for %%R in (
        "[\i]=;"
        "[\e]=="
        "[\s]= "
        "[\t]=%_HT%"
        "[\v]=%_VT%"
        "[\f]=%_FF%"
        "[\x]=%_XX%"
    ) do set "%~1%%~R"
    rem // Define wildcards as substitutions too:
    set "%~1[\a]=*"
    set "%~1[\q]=?"
    set "%~1[\l]=<"
    set "%~1[\g]=>"
    rem set "%~1[\m]=!"
    rem set "%~1[\b]=\"
    rem set "%~1[\c]=,"
    exit /B


:LENGTH  <rtn_length>  <ref_string>
    rem // Determine length of a string:
    setlocal EnableDelayedExpansion
    set "STR=!%~2!"
    if not defined STR (set /A "LEN=0") else (set /A "LEN=1")
    for %%L in (4096 2048 1024 512 256 128 64 32 16 8 4 2 1) do (
        if defined STR (
            set "INT=!STR:~%%L!"
            if not "!INT!" == "" set /A "LEN+=%%L" & set "STR=!INT!"
        )
    )
    endlocal & set "%~1=%LEN%"
    exit /B


:REPLCHAR  <rtn_string>  <ref_string>  <val_char>  <val_replace>
    rem // Replace given character in a string by another string:
    setlocal
    set "DXF=!"
    setlocal DisableDelayedExpansion
    set "CHR=%~3"
    set "RPL=%~4"
    setlocal EnableDelayedExpansion
    set "STR=!%~2!"
    if defined CHR (
        call :LENGTH LEN STR
        call :LENGTH LCH CHR
        set /A "LEN-=1" & for /L %%P in (!LEN!,-1,0) do (
            for %%O in (!LCH!) do (
                if "!STR:~%%P,%%O!" == "!CHR!" (
                    set /A "INC=%%P+%%O" & for %%Q in (!INC!) do (
                        set "STR=!STR:~,%%P!!RPL!!STR:~%%Q!"
                    )
                )
            )
        )
    )
    if not defined DXF if defined STR set "STR=!STR:"=""!"
    if not defined DXF if defined STR set "STR=!STR:^=^^^^!"
    if not defined DXF if defined STR set "STR=%STR:!=^^^!%" !
    if not defined DXF if defined STR set "STR=!STR:""="!"
    for /F "delims=" %%E in (^""!STR!"^") do (
        endlocal & endlocal & endlocal & set "%~1=%%~E" !
    )
    exit /B

复杂的事情是正确处理无引号和带引号的分隔符(,);解释了此脚本的大小。

鉴于脚本名为reconstruct-csv.bat,输入的CSV文件名为File.csv,请使用以下命令行运行它:

reconstruct-csv.bat "File.csv"

要将输出写到另一个CSV文件中,例如说File_NEW.csv,而不是显示它,请使用以下方法:

reconstruct-csv.bat "File.csv" > "File_NEW.csv"