DataFrames的一个不错的功能是它可以存储具有不同类型的列,并且可以“自动识别”它们,例如:
using DataFrames, DataStructures
df1 = wsv"""
parName region forType value
vol AL broadL_highF 3.3055628012
vol AL con_highF 2.1360975151
vol AQ broadL_highF 5.81984502
vol AQ con_highF 8.1462998309
"""
typeof(df1[:parName])
DataArrays.DataArray{String,1}
typeof(df1[:value])
DataArrays.DataArray{Float64,1}
当我尝试从矩阵(从电子表格导入)开始达到相同的结果时,我“松散”了自动转换:
dataMatrix = [
"parName" "region" "forType" "value";
"vol" "AL" "broadL_highF" 3.3055628012;
"vol" "AL" "con_highF" 2.1360975151;
"vol" "AQ" "broadL_highF" 5.81984502;
"vol" "AQ" "con_highF" 8.1462998309;
]
h = [Symbol(c) for c in dataMatrix[1,:]]
vals = dataMatrix[2:end, :]
df2 = convert(DataFrame,OrderedDict(zip(h,[vals[:,i] for i in 1:size(vals,2)])))
typeof(df2[:parName])
DataArrays.DataArray{Any,1}
typeof(df2[:value])
DataArrays.DataArray{Any,1}
S.O.有几个问题。关于如何将矩阵转换为数据帧(例如DataFrame from Array with Header,Convert Julia array to dataframe),但没有一个答案涉及混合型矩阵的转换。
如何从矩阵中自动识别列的类型创建DataFrame?
编辑:我did benchmark the three solutions :( 1)转换df(使用字典或矩阵构造函数..第一个更快)然后应用try-catch进行类型转换(我的原始答案); (2)转换为字符串然后使用df.inlinetable(Dan Getz回答); (3)检查每个元素的类型及其列式一致性(Alexander Morley答案)。
结果如下:
# second time for compilation.. further times ~ results
@time toDf1(m) # 0.000946 seconds (336 allocations: 19.811 KiB)
@time toDf2(m) # 0.000194 seconds (306 allocations: 17.406 KiB)
@time toDf3(m) # 0.001820 seconds (445 allocations: 35.297 KiB)
所以,疯狂的是,最有效的解决方案似乎是“倒水”并将问题减少到已经解决的问题; - )
感谢您的所有答案。
答案 0 :(得分:2)
另一种方法是重用工作解决方案,即将矩阵转换为适合DataFrames使用的字符串。在代码中,这是:
using DataFrames
dataMatrix = [
"parName" "region" "forType" "value";
"vol" "AL" "broadL_highF" 3.3055628012;
"vol" "AL" "con_highF" 2.1360975151;
"vol" "AQ" "broadL_highF" 5.81984502;
"vol" "AQ" "con_highF" 8.1462998309;
]
s = join(
[join([dataMatrix[i,j] for j in indices(dataMatrix, 2)]
, '\t') for i in indices(dataMatrix, 1)], '\n')
df = DataFrames.inlinetable(s; separator='\t', header=true)
生成的df
的列类型由DataFrame猜测。
无关,但这个回答让我想起了how a mathematician boils water joke。
答案 1 :(得分:1)
虽然我认为可能有更好的方法来处理整个事情,但这应该做你想要的。
df = DataFrame()
for (ind,s) in enumerate(Symbol.(dataMatrix[1,:])) # convert first row to symbols and iterate through them.
# check all types the same else assign to Any
T = typeof(dataMatrix[2,ind])
T = all(typeof.(dataMatrix[2:end,ind]).==T) ? T : Any
# convert to type of second element then add to data frame
df[s] = T.(dataMatrix[2:end,ind])
end
答案 2 :(得分:1)
mat2df(mat) =
DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))
似乎工作并且比@ dan-getz的答案更快(至少对于这个数据矩阵):)
using DataFrames, BenchmarkTools
dataMatrix = [
"parName" "region" "forType" "value";
"vol" "AL" "broadL_highF" 3.3055628012;
"vol" "AL" "con_highF" 2.1360975151;
"vol" "AQ" "broadL_highF" 5.81984502;
"vol" "AQ" "con_highF" 8.1462998309;
]
mat2df(mat) =
DataFrame([[mat[2:end,i]...] for i in 1:size(mat,2)], Symbol.(mat[1,:]))
function mat2dfDan(mat)
s = join([join([dataMatrix[i,j] for j in indices(dataMatrix, 2)], '\t')
for i in indices(dataMatrix, 1)],'\n')
DataFrames.inlinetable(s; separator='\t', header=true)
end
-
julia> @benchmark mat2df(dataMatrix)
BenchmarkTools.Trial:
memory estimate: 5.05 KiB
allocs estimate: 75
--------------
minimum time: 18.601 μs (0.00% GC)
median time: 21.318 μs (0.00% GC)
mean time: 31.773 μs (2.50% GC)
maximum time: 4.287 ms (95.32% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark mat2dfDan(dataMatrix)
BenchmarkTools.Trial:
memory estimate: 17.55 KiB
allocs estimate: 318
--------------
minimum time: 69.183 μs (0.00% GC)
median time: 81.326 μs (0.00% GC)
mean time: 90.284 μs (2.97% GC)
maximum time: 5.565 ms (93.72% GC)
--------------
samples: 10000
evals/sample: 1
答案 3 :(得分:-2)
虽然我没有找到完整的解决方案,但部分原因是尝试事后转换各个列:
"""
convertDf!(df)
Try to convert each column of the converted df from Any to In64, Float64 or String (in that order).
"""
function convertDf!(df)
for c in names(df)
try
df[c] = convert(DataArrays.DataArray{Int64,1},df[c])
catch
try
df[c] = convert(DataArrays.DataArray{Float64,1},df[c])
catch
try
df[c] = convert(DataArrays.DataArray{String,1},df[c])
catch
end
end
end
end
end
虽然确实不完整,但这足以满足我的需求。