尝试重新划分我的数据帧以实现并行性。建议每个分区的大小应小于128MB,为了实现此目的,我需要计算数据帧中每行的大小。那么如何计算/查找数据框中的每一行大小?
谢谢。
答案 0 :(得分:0)
正如我在第一条评论中提到的链接中所讨论的那样,您可以使用java.lang.instrument
我建议的解决方案是在Java
,Maven
和Spark 2.4.0
中
您必须具有以下结构,否则必须将pom.xml调整为适合您的结构:
src
--main
----java
------size
--------Sizeof.java
------spark
--------SparkJavaTest.java
----resources
------META-INF
--------MANIFEST.MF
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.formation.SizeOf</groupId>
<artifactId>SizeOf</artifactId>
<version>1.0-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifestFile>
src/main/resources/META-INF/MANIFEST.MF
</manifestFile>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>
spark.SparkJavaTest
</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- bind to the packaging phase -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
</dependencies>
</project>
大小
package size;
import java.lang.instrument.Instrumentation;
final public class Sizeof {
private static Instrumentation instrumentation;
public static void premain(String args, Instrumentation inst) {
instrumentation = inst;
}
public static long sizeof(Object o) {
return instrumentation.getObjectSize(o);
}
}
SparkJavaTest
package spark;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import size.Sizeof;
public class SparkJavaTest {
public static SparkSession spark = SparkSession
.builder()
.appName("JavaSparkTest")
.master("local")
.getOrCreate();
public static void main(String[] args) {
Dataset<Row> ds = spark.read().option("header",true).csv("sample.csv");
ds.show(false);
// Get the size of a Dataset
System.out.println("size of ds " + Sizeof.sizeof(ds));
JavaRDD dsToJavaRDD = ds.toJavaRDD();
// Get the size of a JavaRDD
System.out.println("size of rdd" + Sizeof.sizeof(dsToJavaRDD));
}
}
MANIFEST.MF
Manifest-Version: 1.0
Premain-Class: size.Sizeof
Main-Class: spark.SparkJavaTest
然后,您清洁并包装:
mvn clean package
然后,您可以运行并获取对象的大小:
java -javaagent:target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar -jar target/SizeOf-1.0-SNAPSHOT-jar-with-dependencies.jar