Question

我正在尝试将python DataFrame保存到pySpark中带有.saveAsTable()的Hive表（Parquet），但仍然会遇到以下内存问题：

org.apache.hadoop.hive.ql.metadata.HiveException: parquet.hadoop.MemoryManager$1:
New Memory allocation 1034931 bytes is smaller than the minimum allocation size of 1048576 bytes.

第一个数字（1034931）通常会在不同的运行中不断变化。我认识到第二个数字（1048576）是1024^2，但我不知道这意味着什么。

我一直在为我的其他一些项目（使用更大的DataFrames）使用完全相同的技术，并且它没有问题。在这里，我基本上复制粘贴过程和配置的结构，但运行到内存问题！它一定是我失踪的微不足道的东西。

Spark DataFrame（让我们称之为sdf）具有结构（~10列和~300k行，但如果运行正确则可能更多）：

+----------+----------+----------+---------------+---------------+
| col_a_str| col_b_num| col_c_num|partition_d_str|partition_e_str|
+----------+----------+----------+---------------+---------------+
|val_a1_str|val_b1_num|val_c1_num|     val_d1_str|     val_e1_str|
|val_a2_str|val_b2_num|val_c2_num|     val_d2_str|     val_e2_str|
|       ...|       ...|       ...|            ...|            ...|
+----------+----------+----------+---------------+---------------+

Hive表的创建方式如下：

sqlContext.sql("""
                    CREATE TABLE IF NOT EXISTS my_hive_table (
                        col_a_str string,
                        col_b_num double,
                        col_c_num double
                    ) 
                    PARTITIONED BY (partition_d_str string,
                                    partition_e_str string)
                    STORED AS PARQUETFILE
               """)

尝试将数据插入此表时使用以下命令：

sdf.write \
   .mode('append') \
   .partitionBy('partition_d_str', 'partition_e_str') \
   .saveAsTable('my_hive_table')

Spark / Hive配置如下：

spark_conf = pyspark.SparkConf()
spark_conf.setAppName('my_project')

spark_conf.set('spark.executor.memory', '16g')
spark_conf.set('spark.python.worker.memory', '8g')
spark_conf.set('spark.yarn.executor.memoryOverhead', '15000')
spark_conf.set('spark.dynamicAllocation.maxExecutors', '64')
spark_conf.set('spark.executor.cores', '4')

sc = pyspark.SparkContext(conf=spark_conf)

sqlContext = pyspark.sql.HiveContext(sc)
sqlContext.setConf('hive.exec.dynamic.partition', 'true')
sqlContext.setConf('hive.exec.max.dynamic.partitions', '5000')
sqlContext.setConf('hive.exec.dynamic.partition.mode', 'nonstrict')
sqlContext.setConf('hive.exec.compress.output', 'true')

我尝试将.partitionBy('partition_d_str', 'partition_e_str')更改为.partitionBy(['partition_d_str', 'partition_e_str'])，增加内存，将DataFrame拆分为较小的块，重新创建表和DataFrame，但似乎没有任何效果。我也无法在线找到任何解决方案。什么会导致内存错误（我不完全理解它来自哪里），以及如何更改我的代码以写入Hive表？感谢。

Answer 1

事实证明，我正在使用可以忽略import java.util.* import java.io.*; class FileProcessing { public static void main(String[] args) throws IOException { letters(); } public static void letters() throws IOException { int count; PrintStream out = new PrintStream(new File("nums.txt")); /*Outer loop. When the loop on the inside finishes generating *a word, this loop will iterate again. */ for(int i=0; i<400; ++i) { count=0; /*your current while loop*/ while (count < 7) { Random rand = new Random(); int randomNum = 97 + rand.nextInt((122-97)+1); char a = (char) randomNum; out.print(a); count++; } //print new line so all words are in a separate line out.println(); } //close PrintStream out.close(); } }的可空字段进行分区。当我将RDD转换为Spark DataFrame时，我提供的架构生成如下：

.saveAsTable()

由于from pyspark.sql.types import * # Define schema my_schema = StructType( [StructField('col_a_str', StringType(), False), StructField('col_b_num', DoubleType(), True), StructField('col_c_num', DoubleType(), True), StructField('partition_d_str', StringType(), False), StructField('partition_e_str', StringType(), True)]) # Convert RDD to Spark DataFrame sdf = sqlContext.createDataFrame(my_rdd, schema=my_schema)被声明为partition_e_str（该nullable=True的第三个参数），因此在写入Hive表时遇到了问题，因为它被用作其中一个分区字段。我改成了：

StructField

一切都很好！

课程：确保您的分区字段不可为空！

将Spark DataFrame写入Hive表时的内存分配问题

1 个答案: