我正在评估使用Erlang ETS来存储大量的内存数据集。我的测试数据源是一个CSV文件,只消耗350 MB的磁盘。
我的解析器逐行读取并将其拼接成一个列表,然后创建元组并使用“bag”配置将其存储在 ETS 中。
在加载ETS中的所有数据后,我注意到我的计算机的8GB RAM全部消失了,操作系统创建了虚拟内存,占用了16GB或RAM附近。 erlang的Beam进程似乎比磁盘数据的大小消耗了大约 10倍倍的内存。
以下是测试代码:
-module(load_test_data).
-author("gextra").
%% API
-export([test/0]).
init_ets() ->
ets:new(memdatabase, [bag, named_table]).
parse(File) ->
{ok, F} = file:open(File, [read, raw]),
parse(F, file:read_line(F), []).
parse(F, eof, Done) ->
file:close(F),
lists:reverse(Done);
parse(F, Line, Done) ->
parse(F, file:read_line(F), [ parse_row_commodity_data(Line) | Done ]).
parse_row_commodity_data(Line) ->
{ok, Data} = Line,
%%io:fwrite(Data),
LineList = re:split(Data,"\,",[{return,list}]),
ReportingCountry = lists:nth(1, LineList),
YearPeriod = lists:nth(2, LineList),
Year = lists:nth(3, LineList),
Period = lists:nth(4, LineList),
TradeFlow = lists:nth(5, LineList),
Commodity = lists:nth(6, LineList),
PartnerCountry = lists:nth(7, LineList),
NetWeight = lists:nth(8, LineList),
Value = lists:nth(9, LineList),
IsReported = lists:nth(10, LineList),
ets:insert(memdatabase, {YearPeriod ++ ReportingCountry ++ Commodity , { ReportingCountry, Year, Period, TradeFlow, Commodity, PartnerCountry, NetWeight, Value, IsReported } }).
test() ->
init_ets(),
parse("/data/000-2010-1.csv").
答案 0 :(得分:4)
它强烈依赖于你的意思将它拼接到一个列表中,然后创建一个元组。特别是拼接到列表中会占用大量内存。如果拆分成列表,则一个字节可占用16B。它很简单,只有5.6GB。
修改强>:
试试这个:
parse(File) ->
{ok, F} = file:open(File, [read, raw, binary]),
ok = parse(F, binary:compile_pattern([<<$,>>, <<$\n>>])),
ok = file:close(F).
parse(F, CP) ->
case file:read_line(F) of
{ok, Line} ->
parse_row_commodity_data(Line, CP),
parse(F, CP);
eof -> ok
end.
parse_row_commodity_data(Line, CP) ->
[ ReportingCountry, YearPeriod, Year, Period, TradeFlow, Commodity,
PartnerCountry, NetWeight, Value, IsReported]
= binary:split(Line, CP, [global, trim]),
true = ets:insert(memdatabase, {
{YearPeriod, ReportingCountry, Commodity},
{ ReportingCountry, Year, Period, TradeFlow, Commodity,
PartnerCountry, NetWeight, Value, IsReported}
}).