用Apache Beam进行线性回归

时间:2018-08-21 23:06:01

标签: apache-beam

如何在光束管道中拟合大量线性回归?我有一个很大的csv,我想根据A和B的两列对每列进行归一化。也就是说,我想为csv X中的每一列获取X〜A + B的标准残差。 / p>

1 个答案:

答案 0 :(得分:0)

这是一个有趣的用例。您可以这样做:

INDEX_A =  # Something
INDEX_B =  # Something else

parsed_rows = pipeline | beam.ReadFromText(my_csv)
                       | beam.Map(parse_each_line)

def column_paired_rows(row):
  for idx, val in row:
    if idx in (INDEX_A, INDEX_B): continue
    # Yield the values keyed with the independent + dependent variable indices
    yield ((INDEX_A, idx), {'independent_var_value': row[INDEX_A],
                            'independent_var_idx': INDEX_A,
                            'dependent_var_value': val,
                            'dependent_var_idx': idx})
    yield ((INDEX_B, idx), {'independent_var_value': row[INDEX_B],
                            'independent_var_idx': INDEX_B,
                            'dependent_var_value': val,
                            'dependent_var_idx': idx})

column_pairs = parsed_rows | beam.FlatMap(column_paired_rows) | beam.GroupByKey()

column_pairs PCollection将按independent, dependent变量对将所有元素分组,然后可以运行分析。

def perform_linear_regression(elm):
  key = elm[0]   # KEY is a tuple with (independent variable index, dependent variable index)
  values = elm[1]    # This is an iterable with the data points that you need.
  pairs = [(v['independent_var_value'], v['dependent_var_value']) for v in values]
  model = linear_regression(pairs)
  return (key, model)

models = column_pairs | beam.Map(perform_linear_regression)

LMK,如果您希望我添加更多详细信息