如何在Julia中将混合类型Matrix转换为DataFrame来识别列类型

时间:2017-09-29 10:49:26

标签: matrix dataframe type-conversion julia

DataFrames的一个不错的功能是它可以存储具有不同类型的列,并且可以“自动识别”它们,例如:

using DataFrames, DataStructures

df1 = wsv"""
parName region  forType             value
vol     AL      broadL_highF        3.3055628012
vol     AL      con_highF           2.1360975151
vol     AQ      broadL_highF        5.81984502
vol     AQ      con_highF           8.1462998309
"""
typeof(df1[:parName])
DataArrays.DataArray{String,1}
typeof(df1[:value])
DataArrays.DataArray{Float64,1}

当我尝试从矩阵(从电子表格导入)开始达到相同的结果时,我“松散”了自动转换:

dataMatrix = [
    "parName"   "region"    "forType"       "value";
    "vol"       "AL"        "broadL_highF"  3.3055628012;
    "vol"       "AL"        "con_highF"     2.1360975151;
    "vol"       "AQ"        "broadL_highF"  5.81984502;
    "vol"       "AQ"        "con_highF"     8.1462998309;
]
h    = [Symbol(c) for c in dataMatrix[1,:]]
vals = dataMatrix[2:end, :]
df2  = convert(DataFrame,OrderedDict(zip(h,[vals[:,i] for i in 1:size(vals,2)])))

typeof(df2[:parName])  
DataArrays.DataArray{Any,1}
typeof(df2[:value])  
DataArrays.DataArray{Any,1}

S.O.有几个问题。关于如何将矩阵转换为数据帧(例如DataFrame from Array with HeaderConvert Julia array to dataframe),但没有一个答案涉及混合型矩阵的转换。

如何从矩阵中自动识别列的类型创建DataFrame?

编辑:我did benchmark the three solutions :( 1)转换df(使用字典或矩阵构造函数..第一个更快)然后应用try-catch进行类型转换(我的原始答案); (2)转换为字符串然后使用df.inlinetable(Dan Getz回答); (3)检查每个元素的类型及其列式一致性(Alexander Morley答案)。

结果如下:

# second time for compilation.. further times ~ results
@time toDf1(m) # 0.000946 seconds (336 allocations: 19.811 KiB)
@time toDf2(m) # 0.000194 seconds (306 allocations: 17.406 KiB)
@time toDf3(m) # 0.001820 seconds (445 allocations: 35.297 KiB)

所以,疯狂的是,最有效的解决方案似乎是“倒水”并将问题减少到已经解决的问题; - )

感谢您的所有答案。

4 个答案:

答案 0 :(得分:2)

另一种方法是重用工作解决方案,即将矩阵转换为适合DataFrames使用的字符串。在代码中,这是:

using DataFrames

dataMatrix = [
    "parName"   "region"    "forType"       "value";
    "vol"       "AL"        "broadL_highF"  3.3055628012;
    "vol"       "AL"        "con_highF"     2.1360975151;
    "vol"       "AQ"        "broadL_highF"  5.81984502;
    "vol"       "AQ"        "con_highF"     8.1462998309;
]

s = join(
  [join([dataMatrix[i,j] for j in indices(dataMatrix, 2)]
  , '\t') for i in indices(dataMatrix, 1)], '\n')

df = DataFrames.inlinetable(s; separator='\t', header=true)

生成的df的列类型由DataFrame猜测。

无关,但这个回答让我想起了how a mathematician boils water joke

答案 1 :(得分:1)

虽然我认为可能有更好的方法来处理整个事情,但这应该做你想要的。

df = DataFrame()
for (ind,s) in enumerate(Symbol.(dataMatrix[1,:])) # convert first row to symbols and iterate through them.
    # check all types the same else assign to Any
    T = typeof(dataMatrix[2,ind])
    T = all(typeof.(dataMatrix[2:end,ind]).==T) ? T : Any
    # convert to type of second element then add to data frame
    df[s] = T.(dataMatrix[2:end,ind])
end

答案 2 :(得分:1)

mat2df(mat) = 
    DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))

似乎工作并且比@ dan-getz的答案更快(至少对于这个数据矩阵):)

using DataFrames, BenchmarkTools

dataMatrix = [
    "parName"   "region"    "forType"       "value";
    "vol"       "AL"        "broadL_highF"  3.3055628012;
    "vol"       "AL"        "con_highF"     2.1360975151;
    "vol"       "AQ"        "broadL_highF"  5.81984502;
    "vol"       "AQ"        "con_highF"     8.1462998309;
]

mat2df(mat) = 
    DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))

function mat2dfDan(mat)
    s = join([join([dataMatrix[i,j] for j in indices(dataMatrix, 2)], '\t') 
                for i in indices(dataMatrix, 1)],'\n')

    DataFrames.inlinetable(s; separator='\t', header=true)
end

-

julia> @benchmark mat2df(dataMatrix)

BenchmarkTools.Trial: 
  memory estimate:  5.05 KiB
  allocs estimate:  75
  --------------
  minimum time:     18.601 μs (0.00% GC)
  median time:      21.318 μs (0.00% GC)
  mean time:        31.773 μs (2.50% GC)
  maximum time:     4.287 ms (95.32% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark mat2dfDan(dataMatrix)

BenchmarkTools.Trial: 
  memory estimate:  17.55 KiB
  allocs estimate:  318
  --------------
  minimum time:     69.183 μs (0.00% GC)
  median time:      81.326 μs (0.00% GC)
  mean time:        90.284 μs (2.97% GC)
  maximum time:     5.565 ms (93.72% GC)
  --------------
  samples:          10000
  evals/sample:     1

答案 3 :(得分:-2)

虽然我没有找到完整的解决方案,但部分原因是尝试事后转换各个列:

"""
    convertDf!(df)

Try to convert each column of the converted df from Any to In64, Float64 or String (in that order).    
"""
function convertDf!(df)
    for c in names(df)
        try
          df[c] = convert(DataArrays.DataArray{Int64,1},df[c])
        catch
            try
              df[c] = convert(DataArrays.DataArray{Float64,1},df[c])
            catch
                try
                  df[c] = convert(DataArrays.DataArray{String,1},df[c])
                catch
                end
            end
        end
    end
end 

虽然确实不完整,但这足以满足我的需求。