Question

我有一个pyspark应用程序。我将一个配置单元表复制到了我的hdfs目录，＆amp;在python中<artifactId>maven-ejb-plugin</artifactId> <version>2.5.1</version> <configuration>  <ejbVersion>3.1</ejbVersion> <archive> <manifestFile>${basedir}/src/main/resources/META-INF/manifest.mf</manifestFile> </archive> </configuration>对此表进行查询。现在这个变量是我称之为sqlContext.sql的数据帧。我需要随机调整rows，因此我必须将它们转换为行rows列表。那么我rows_list = rows.collect()将列表移动到位。我需要随机行的数量shuffle(rows_list)：

x 现在我想将allrows2add保存为hive表，或者附加一个现有的hive表（以更容易的方式）。问题是我不能这样做：

for r in range(x): allrows2add.append(rows_list[r])不能这样做，无法推断出架构 all_df = sc.parallelize(allrows2add).toDF()

没有放入整个架构。 ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling的架构有117列，所以我不想输出它们。有没有办法提取rows的模式，以帮助我使allrows2add成为一个数据帧或以某种方式保存为一个蜂巢表？我可以 rows但不确定如何将其作为传递rows.printSchema()的变量的模式格式，而不必解析所有文本

由于

添加循环信息

toDF()

Answer 1

当无法推断出架构时，通常有一个原因。 toDF是createDataFrame函数的语法糖，默认情况下只使用前100行（despite the docs表示它只使用第一行）来确定模式应该是什么。要更改此设置，您可以增加采样率以查看更大比例的数据：

df = rdd.toDF(sampleRatio=0.2)
# or...
df = sqlContext.createDataFrame(rdd, samplingRatio=0.2)

您的随机样本也可能只会为某些特定列采用空值的行。如果是这种情况，您可以create a schema from scratch这样：

from pyspark.sql.types import *
# all DataFrame rows are StructType
# can create a new StructType with combinations of StructField
schema = StructType([
    StructField("column_1", StringType(), True),
    StructField("column_2", IntegerType(), True),
    # etc.
])
df = sqlContext.createDataFrame(rdd, schema=schema)

或者，您可以通过访问schema值来获取您之前创建的DataFrame中的架构：

df2 = sqlContext.createDataFrame(rdd, schema=df1.schema)

请注意，如果您的RDD行不是StructType（a.k.a。Row）对象而不是字典或列表，您将无法从它们创建数据框。如果您的RDD行是字典，您可以将它们转换为Row对象，如下所示：

rdd = rdd.map(lambda x: pyspark.sql.Row(**x))
# ** is to unpack the dictionary since the Row constructor
# only takes keyword arguments

将行列表保存到pyspark中的Hive表

1 个答案: