Question

我正在使用Pyspark 1.2.1和Hive。（升级不会立即发生。）

我遇到的问题是，当我从Hive表中选择并添加索引时，pyspark会将long值更改为整数，因此我最终会得到一个临时表，其中列为Long类型，但值类型为Integer。（见下面的代码）。

我的问题是：我怎样才能（a）执行索引的合并（参见代码）而不将long更改为整数;或（b）以避免问题的其他方式加入指数;或者（c）随机化表格列而不需要加入？

我试图解决的根本问题是我想随机化一个hive表中某些列的顺序，并将其写入一个新表。这是为了使数据不再是个人身份识别。我通过在原始表和随机列中添加递增索引，然后加入该索引来做到这一点。

表格如下：

primary | longcolumn | randomisecolumn

代码是：

hc = HiveContext(sc)
orig = hc.sql('select * from mytable')
widx = orig.zipWithIndex().map(merge_index_on_row)
sql_context.applySchema(widx, add_index_schema(orig.schema()))
        .registerTempTable('sani_first')

# At this point sani_first has a column longcolumn with type long,
# but (many of) the values are ints

def merge_index_on_row((row, idx), idx_name=INDEX_COL):
    """
    Row is a SchemaRDD row object; idx is an integer;
    schema is the schema for row with an added index col at the end
    returns a version of row applying schema and holding the index in the new row
    """
    as_dict = row.asDict()
    as_dict[idx_name] = idx
    return Row(**as_dict)

def add_index_schema(schema):
    """
    Take a schema, add a column for an index, return the new schema
    """
    return StructType(sorted(schema.fields + [StructField(INDEX_COL, IntegerType(), False)],key=lambda x:x.name))

如果没有更好的解决方案，我将在python代码中强制受影响的列为long类型。这......不太好。

Pyspark将多头改为整数

0 个答案: