CSV格式 - 从特定字段中剥离限定符

时间:2015-03-05 16:47:55

标签: csv export-to-csv

如果以前曾问过这个问题,我很抱歉,但我找不到类似的东西。

我正在接收CSV输出,该输出使用"作为每个字段周围的文本限定符。我正在寻找一个优雅的解决方案来重新格式化这些,以便只有特定的(字母数字字段)具有这些限定符。

我收到的一个例子:

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"

我想要的输出是这样的:

"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25

非常感谢任何建议或帮助!

下面的每个请求找到示例文件的前五行:

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","Recruit-Navy,XL#28-75","13.25","13.25"
"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","","10/16/14","","1","High Peak-Navy,XL#21-18","36.75","36.75"
"TRI-MOUNTAIN/MOUNTAI","F257186","Z1023384","","10/15/14","","1","Patriot-Red,L#26-35","25.50","25.50"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-Red/Gray,S#23-52","19.75","19.75"
"TRI-MOUNTAIN/MOUNTAI","F260780","Z1023658","","10/20/14","","1","Exeter-White/Gray,XL#23-56","19.75","19.75"

请注意,这只是一个示例,并非所有文件都适用于Tri-Mountain。

2 个答案:

答案 0 :(得分:0)

此问题表示难以将引号与逗号分隔的字段分开,其中字段本身包含嵌入的逗号。 (例如:"Recruit-Navy,XL#28-75")有许多方法可以从shell的角度(while readawk等)来解决这个问题。但是大多数方法最终会在嵌入式逗号上发现错误。

发现成功的一种方法是对该线进行蛮力character-by-character解析。 (下面)这不是一个优雅的解决方案,但它会让你开始。 shell程序的另一种替代方法是编译语言,例如C,其中字符处理更加健壮。如果您有任何疑问,请发表评论。

#!/bin/bash

declare -a arr
declare -i ct=0

## fill array with separated fields (preserving comma in fields)
#  Note: the following is a single-line (w/continuations for readability)
arr=( $( line='"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"'; \
for ((i=0; i < ${#line}; i++)); do \
    if test "${line:i:1}" == ',' ; then \
        if test "${line:i+1:1}" == '"' -o "${line:i-1:1}" == '"' ; then \
            printf " "; \
        else \
            printf "%c" ${line:i:1}; \
        fi; \
    else \
        printf "%c" ${line:i:1}; \
    fi; \
done; \
printf "\n" ) )

## remove quotes from non-numeric fields
for i in "${arr[@]}"; do 
    if [[ "${i:0:1}" == '"' ]] && [[ ${i:1:1} == [0123456789] ]]; then
        arr[$ct]="${i//\"/}"
    else
        arr[$ct]="$i"
    fi
    if test "$ct" -eq 0 ; then
        printf "%s" "${arr[ct]}"
    else
        printf ",%s" "${arr[ct]}"
    fi
    ((ct++))
done

printf "\n"

exit 0

<强>输出

$ bash sepquoted.sh
"TRI-MOUNTAIN/MOUNTAI","F258273",41016053,"A",10/16/14,3,1,"Recruit-Navy,XL#28-75",13.25,13.25

<强>原始

"TRI-MOUNTAIN/MOUNTAI","F258273","41016053","A","10/16/14",3,"1","Recruit-Navy,XL#28-75","13.25","13.25"

答案 1 :(得分:0)

由于您未指定操作系统或语言,因此这是PowerShell版本。

由于你的非标准CSV文件并切换到原始文件处理,我放弃了之前使用Import-CSV的尝试。也应该明显更快。

要拆分CSV的正则表达式来自这个问题:How to split a string by comma ignoring comma in double quotes

将此脚本另存为StripQuotes.ps1。它接受以下论点:

  • InPath 文件夹,用于从中读取CSV。如果未指定,则使用当前目录。
  • OutPath 文件夹,用于将已处理的CSV保存到。将被创建,如果不存在。
  • 编码如果未指定,脚本将使用系统当前的ANSI代码页来读取文件。您可以在PowerShell控制台中获取系统的其他有效编码,如下所示:[System.Text.Encoding]::GetEncodings()
  • 详细脚本会通过Write-Verbose消息告诉您发生了什么。

示例(从PowerShell控制台运行)。

处理文件夹C:\CSVs_are_here中的所有CSV,将已处理的CSV保存到文件夹C:\Processed_CSVs,请详细说明:

.\StripQuotes.ps1 -InPath 'C:\CSVs_are_here' -OutPath 'C:\Processed_CSVs' -Verbose

StripQuotes.ps1脚本:

Param
(
    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            throw "Input folder doesn't exist: $_"
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$InPath = (Get-Location -PSProvider FileSystem).Path,

    [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
    [ValidateScript({
        if(!(Test-Path -LiteralPath $_ -PathType Container))
        {
            try
            {
                New-Item -ItemType Directory -Path $_ -Force
            }
            catch
            {
                throw "Can't create output folder: $_"
            }
        }
        $true
    })]
    [ValidateNotNullOrEmpty()]
    [string]$OutPath,

    [Parameter(ValueFromPipelineByPropertyName = $true)]
    [string]$Encoding = 'Default'
)


if($Encoding -eq 'Default')
{
    # Set default encoding
    $FileEncoding = [System.Text.Encoding]::Default
}
else
{
    # Try to set user-specified encoding
    try
    {
        $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
    }
    catch
    {
        throw "Not valid encoding: $Encoding"
    }
}

$DQuotes = '"'
$Separator = ','
# https://stackoverflow.com/questions/15927291/how-to-split-a-string-by-comma-ignoring-comma-in-double-quotes
$SplitRegex = "$Separator(?=(?:[^$DQuotes]|$DQuotes[^$DQuotes]*$DQuotes)*$)"
# Matches a single code point in the category "letter".
$AlphaNumRegex = '\p{L}'

Write-Verbose "Input folder: $InPath"
Write-Verbose "Output folder: $OutPath"

# Iterate over each CSV file in the $InPath
Get-ChildItem -LiteralPath $InPath -Filter '*.csv' |
    ForEach-Object {
        Write-Verbose "Current file: $($_.FullName)"
        $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
            $_.FullName,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamReader'

        $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
            (Join-Path -Path $OutPath -ChildPath $_.Name),
            $false,
            $FileEncoding
        ) -ErrorAction Stop
        Write-Verbose 'Created new StreamWriter'

        Write-Verbose 'Processing file...'
        while(($line = $InFile.ReadLine()) -ne $null)
        {
            $tmp = $line -split $SplitRegex |
                        ForEach-Object {
                            # Strip double quotes, if any
                            $item = $_.Trim($DQuotes)

                            if($_ -match $AlphaNumRegex)
                            {
                                # If field has at least one letter - wrap in quotes
                                $DQuotes + $item + $DQuotes
                            }
                            else
                            {
                                # Else, pass it as is
                                $item
                            }
                        }
            # Write line to the new CSV file
            $OutFile.WriteLine($tmp -join $Separator)
        }

        Write-Verbose "Finished processing file: $($_.FullName)"
        Write-Verbose "Processed file is saved as: $($OutFile.BaseStream.Name)"

        # Close open files and cleanup objects
        $OutFile.Flush()
        $OutFile.Close()
        $OutFile.Dispose()

        $InFile.Close()
        $InFile.Dispose()
    }