在this question之后,我现在运行以下代码:
List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
StructType schema1 = DataTypes.createStructType(fields);
Dataset<Row> df1 = spark.sql("select 1 as A, 2.2 as B");
Dataset<Row> finalDf1 = spark.createDataFrame(df1.javaRDD(), schema1);
fields = new ArrayList<>();
fields.add(DataTypes.createStructField("B",DataTypes.DoubleType,true));
fields.add(DataTypes.createStructField("A",DataTypes.LongType,true));
StructType schema2 = DataTypes.createStructType(fields);
Dataset<Row> df2 = spark.sql("select 2.2 as B, 1 as A");
Dataset<Row> finalDf2 = spark.createDataFrame(df2.javaRDD(), schema2);
finalDf1.printSchema();
finalDf2.printSchema();
System.out.println(finalDf1.schema());
System.out.println(finalDf2.schema());
System.out.println(finalDf1.schema().equals(finalDf2.schema()));
以下是输出:
root
|-- A: long (nullable = true)
|-- B: double (nullable = true)
root
|-- B: double (nullable = true)
|-- A: long (nullable = true)
StructType(StructField(A,LongType,true), StructField(B,DoubleType,true))
StructType(StructField(B,DoubleType,true), StructField(A,LongType,true))
false
虽然各列的排列顺序不同,但是这两个数据集的列和列类型都完全相同。为了获得true
,这里需要进行哪些比较?
答案 0 :(得分:1)
假设cols顺序不匹配,并且相同的名称是相同的语义,并且需要相同的列数。
使用SCALA的示例,您应该能够适应JAVA:
import spark.implicits._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val names = df.columns
val df2 = sc.parallelize(Seq(
("A", "X", 2, 1))).toDF("c1", "c2", "Val1", "Val2")
val names2 = df2.columns
names.sortWith(_ < _) sameElements names2.sortWith(_ < _)
返回true或false,请尝试输入。
答案 1 :(得分:0)
如果它们的顺序不同,则它们不相同。即使它们都具有相同的列数和相同的名称。如果要查看两个模式是否具有相同的列名,请从两个数据帧的列表中获取该模式,然后编写代码进行比较。参见下面的Java示例
public static void main(String[] args)
{
List<String> firstSchema =Arrays.asList(DataTypes.createStructType(ConfigConstants.firstSchemaFields).fieldNames());
List<String> secondSchema = Arrays.asList(DataTypes.createStructType(ConfigConstants.secondSchemaFields).fieldNames());
if(schemasHaveTheSameColumnNames(firstSchema,secondSchema))
{
System.out.println("Yes, schemas have the same column names");
}else
{
System.out.println("No, schemas do not have the same column names");
}
}
private static boolean schemasHaveTheSameColumnNames(List<String> firstSchema, List<String> secondSchema)
{
if(firstSchema.size() != secondSchema.size())
{
return false;
}else
{
for (String column : secondSchema)
{
if(!firstSchema.contains(column))
return false;
}
}
return true;
}
答案 2 :(得分:0)
遵循先前的答案,似乎是比较hasMoreEntries = events['has_more'];
while (hasMoreEntries):
url = "https://api.dropboxapi.com/2/team_log/get_events/continue"
headers = {
"Authorization": 'Bearer %s' % aTokenAudit,
"Content-Type": "application/json"
}
data = {
"cursor": events['cursor']
}
r = requests.post(url, headers=headers, data=json.dumps(data))
events = r.json()
hasMoreEntries = events['has_more'];
for event in events['events']:
counter+=1;
print 'member id %s has done %s activites' % (memberId, counter)
(列和类型)(而不仅仅是名称)的最快方法如下:
StructFields