我有一个非常大的CSV文件,有42个变量和200 000条记录。 我想通过map reduce(localbackend)处理它,但我总是得到以下错误:
Error: cannot allocate vector of size 15.6 Gb
In addition: Warning messages:
1: closing unused connection 3 (C:\Users\LSZL~1\AppData\Local\Temp\RtmpgJ2FXm\filea302f8a7363)
2: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
Reached total allocation of 8051Mb: see help(memory.size)
3: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
Reached total allocation of 8051Mb: see help(memory.size)
4: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
Reached total allocation of 8051Mb: see help(memory.size)
5: In paste(rep(l, length(lvs)), rep(lvs, each = length(l)), sep = sep) :
Reached total allocation of 8051Mb: see help(memory.size)
我的代码:
inputformat <- make.input.format("csv", sep = ",", col.names=column_names)
a <- mapreduce(input="X:/BigData/working_dir/census-income.data",
input.format=inputformat,
map = function(k, v){
key = v
return(keyval(key, v[1,1]))
},
reduce = function(k, v){
key = k[1, 1]
val = sum(k[, 2])
return(keyval(key, val))
}
)()
是否有可能不提供不必要的列(+数据)来映射reduce并选择那些其数据是必要的列?
答案 0 :(得分:0)
我终于明白了。
我不知道它是否有效,但它确实有效。
column_names <- c("age","class_of_worker", "industry_code", "occupation_code", "education",
"wage_per_hour", "enrolled_in_edu_inst_last_wk", "marital_status", "major_industry_code",
"major_occupation_code", "race", "hispanic_origin", "sex", "member_of_a_labor_union",
"reason_for_unemployment","full_or_part_time_employment_stat", "capital_gains", "capital_losses",
"divdends_from_stocks", "tax_filer_status", "region_of_previous_residence",
"state_of_previous_residence", "detailed_household_and_family_stat",
"detailed_household_summary_in_household", "instance_weight", "migration_code-change_in_msa",
"migration_code-change_in_reg","migration_code-move_within_reg","live_in_this_house_1_year_ago",
"migration_prev_res_in_sunbelt", "num_persons_worked_for_employer", "total_person_earnings",
"country_of_birth_father", "country_of_birth_mother", "country_of_birth_self", "citizenship",
"own_business_or_self_employed", "fill_inc_questionnaire_for_veteran's_admin",
"veterans_benefits", "weeks_worked_in_year", "year", "CLASS")
important_columns = c("age", "education", "wage_per_hour", "weeks_worked_in_year")
input_file_format =
make.input.format(
"csv",
sep = ",",
col.names = column_names)
input_subset =
mapreduce(
input = "X:/BigData/working_dir/census-income.data",
input.format = input_file_format,
map =
function(k, v)
subset(v, select = important_columns))
input_dataframe = from.dfs(input_subset)
input_dataframe = values(input_dataframe)
input_dataframe