如何检查数据框中的每个记录大小

时间:2018-12-05 06:42:15

标签: apache-spark apache-spark-sql databricks

尝试重新划分我的数据帧以实现并行性。建议每个分区的大小应小于128MB,为了实现此目的,我需要计算数据帧中每行的大小。那么如何计算/查找数据框中的每一行大小?

谢谢。

1 个答案:

答案 0 :(得分:0)

正如我在第一条评论中提到的链接中所讨论的那样,您可以使用java.lang.instrument

我建议的解决方案是在JavaMavenSpark 2.4.0

您必须具有以下结构,否则必须将pom.xml调整为适合您的结构:

src
--main
----java
------size
--------Sizeof.java
------spark
--------SparkJavaTest.java
----resources
------META-INF
--------MANIFEST.MF

pom.xml

    <?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.formation.SizeOf</groupId>
    <artifactId>SizeOf</artifactId>
    <version>1.0-SNAPSHOT</version>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifestFile>
                        src/main/resources/META-INF/MANIFEST.MF
                    </manifestFile>
                    <manifest>
                        <addClasspath>true</addClasspath>
                        <mainClass>
                            spark.SparkJavaTest
                        </mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id> <!-- this is used for inheritance merges -->
                    <phase>package</phase> <!-- bind to the packaging phase -->
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.0</version>
        </dependency>
    </dependencies>
</project>

大小

package size;

import java.lang.instrument.Instrumentation;

final public class Sizeof {
    private static Instrumentation instrumentation;

    public static void premain(String args, Instrumentation inst) {
        instrumentation = inst;
    }

    public static long sizeof(Object o) {
        return instrumentation.getObjectSize(o);
    }
}

SparkJavaTest

package spark;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import size.Sizeof;

public class SparkJavaTest {
    public static SparkSession spark = SparkSession
            .builder()
            .appName("JavaSparkTest")
            .master("local")
            .getOrCreate();


    public static void main(String[] args) {

        Dataset<Row> ds = spark.read().option("header",true).csv("sample.csv");

        ds.show(false);
// Get the size of a Dataset
        System.out.println("size of ds " + Sizeof.sizeof(ds));

        JavaRDD dsToJavaRDD = ds.toJavaRDD();
// Get the size of a JavaRDD
        System.out.println("size of rdd" + Sizeof.sizeof(dsToJavaRDD));

    }
}

MANIFEST.MF

Manifest-Version: 1.0
Premain-Class: size.Sizeof
Main-Class: spark.SparkJavaTest

然后,您清洁并包装:

mvn clean package

然后,您可以运行并获取对象的大小:

java -javaagent:target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar -jar target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar