Question

我尝试使用readtable（）将CSV文件读入DataFrame。 CSV文件存在一个令人遗憾的问题，即如果给定行的最后一个 x 列为空，而不是生成该数量的逗号，则只会结束该行。例如，我可以：

Col1,Col2,Col3,Col4
item1,item2,,item4
item5

请注意，在第三行中，只有一个条目。理想情况下，我希望readtable能够用NA，NA和NA填充Col2，Col3和Col4的值;但是，由于缺少逗号并因此缺少空字符串，readtable（）只是将其视为与列数不匹配的行。如果我使用上面的示例CSV在Julia中运行readtable（），我会得到错误＆＃34;看到2行，2列和5个字段，*行1有6列＆＃34;。如果我在item5之后添加3个逗号，那么它可以正常工作。

有没有解决方法，还是我必须修复CSV文件？

Answer 1

如果CSV解析不需要太多的引用逻辑，那么编写一个特殊用途的解析器来处理丢失列的情况很容易。像这样：

function bespokeread(s)
  headers = split(strip(readline(s)),',')
  ncols = length(headers)
  data = [String[] for i=1:ncols]
  while !eof(s)
    newline = split(strip(readline(s)),',')
    length(newline)<ncols && append!(newline,["" for i=1:ncols-length(newline)])
    for i=1:ncols
      push!(data[i],newline[i])
    end
  end
  return DataFrame(;OrderedDict(Symbol(headers[i])=>data[i] for i=1:ncols)...)
end

然后是文件：

Col1,Col2,Col3,Col4
item1,item2,,item4
item5

会给：

julia> df = bespokeread(f)
2×4 DataFrames.DataFrame
│ Row │ Col1    │ Col2    │ Col3 │ Col4    │
├─────┼─────────┼─────────┼──────┼─────────┤
│ 1   │ "item1" │ "item2" │ ""   │ "item4" │
│ 2   │ "item5" │ ""      │ ""   │ ""      │

Answer 2

Dan Getz的答案很好，但它将所有内容都转换为字符串。

以下解决方案改为“填补”差距并编写一个新文件（以内存有效的方式），然后可以使用readtable（）正常导入：

function fillAll(iF,oF,d=",")
    open(iF, "r") do i
        open(oF, "w") do o # "w" for writing
            headerRow = strip(readline(i))
            headers = split(headerRow,d)
            nCols =  length(headers)
            write(o, headerRow*"\n") 
            for ln in eachline(i)
                nFields = length(split(strip(ln),d))
                write(o, strip(ln))
                [write(o,d) for y in 1:nCols-nFields] # write delimiters to match headers
                write(o,"\n") 
            end
        end
    end
end

fillAll("data.csv","data_out.csv",";")

Answer 3

更好：只需使用CSV.jl。

julia> f = IOBuffer("Col1,Col2,Col3,Col4\nitem1,item2,,item4\nitem5"); # or the filename

julia> CSV.read(f)
2×4 DataFrames.DataFrame
│ Row │ Col1    │ Col2    │ Col3  │ Col4    │
├─────┼─────────┼─────────┼───────┼─────────┤
│ 1   │ "item1" │ "item2" │ #NULL │ "item4" │
│ 2   │ "item5" │ #NULL   │ #NULL │ #NULL   │

具有不同列数的Readtable（） - Julia

3 个答案: