如何在光束管道中拟合大量线性回归?我有一个很大的csv,我想根据A和B的两列对每列进行归一化。也就是说,我想为csv X中的每一列获取X〜A + B的标准残差。 / p>
答案 0 :(得分:0)
这是一个有趣的用例。您可以这样做:
INDEX_A = # Something
INDEX_B = # Something else
parsed_rows = pipeline | beam.ReadFromText(my_csv)
| beam.Map(parse_each_line)
def column_paired_rows(row):
for idx, val in row:
if idx in (INDEX_A, INDEX_B): continue
# Yield the values keyed with the independent + dependent variable indices
yield ((INDEX_A, idx), {'independent_var_value': row[INDEX_A],
'independent_var_idx': INDEX_A,
'dependent_var_value': val,
'dependent_var_idx': idx})
yield ((INDEX_B, idx), {'independent_var_value': row[INDEX_B],
'independent_var_idx': INDEX_B,
'dependent_var_value': val,
'dependent_var_idx': idx})
column_pairs = parsed_rows | beam.FlatMap(column_paired_rows) | beam.GroupByKey()
column_pairs
PCollection将按independent, dependent
变量对将所有元素分组,然后可以运行分析。
def perform_linear_regression(elm):
key = elm[0] # KEY is a tuple with (independent variable index, dependent variable index)
values = elm[1] # This is an iterable with the data points that you need.
pairs = [(v['independent_var_value'], v['dependent_var_value']) for v in values]
model = linear_regression(pairs)
return (key, model)
models = column_pairs | beam.Map(perform_linear_regression)
LMK,如果您希望我添加更多详细信息