第二个左外部联接未使用Spark返回正确的行数

时间:2019-02-26 21:11:18

标签: sql apache-spark pyspark apache-spark-sql pyspark-sql

当前,我正在使用3个数据框,并将它们从network数据框开始,然后将organization数据框与其连接在一起,并使用{{1 }}列。然后使用新的数据框并将OrgID数据框连接到它,使用asn进行连接以执行左外部连接。

数据帧:

OrgID

这些是3个spark数据帧的计数:

network.show(5)
+--------------------+---------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+
|           NetHandle|    OrgID|          Parent|             NetName|            NetRange|     NetType|Comment|            RegDate|            Updated|Source|
+--------------------+---------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------ +
|NET-69-150-149-184-1|C00868859|NET-69-148-0-0-1|SBC06915014918429...|69.150.149.184 - ...|reassignment|   null|2004-07-23 00:00:00|2004-07-23 00:00:00|  ARIN|
| NET-69-224-242-40-1|C00868860|NET-69-224-0-0-1|SBC06922424204029...|69.224.242.40 - 6...|reassignment|   null|2004-07-23 00:00:00|2004-07-23 00:00:00|  ARIN|
| NET-170-55-30-176-1|  CC-3105|NET-170-55-0-0-1|FPLFI-CROWNSCFSW-...|170.55.30.176 - 1...|reassignment|   null|2018-03-26 00:00:00|2018-03-26 00:00:00|  ARIN|
| NET-69-224-249-24-1|C00868862|NET-69-224-0-0-1|SBC06922424902429...|69.224.249.24 - 6...|reassignment|   null|2004-07-23 00:00:00|2004-07-23 00:00:00|  ARIN|
| NET-69-29-107-152-1|C02309164| NET-69-29-0-0-1|   CTEL-CITZENS-BANK|69.29.107.152 - 6...|reassignment|   null|2009-09-03 00:00:00|2009-09-03 00:00:00|  ARIN|
+--------------------+---------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+
only showing top 5 rows

organization.show(5)
+---------+--------------------+-----------+--------------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
|    OrgID|             OrgName|CanAllocate|              Street|        City|State/Prov|Country|PostalCode|            RegDate|            Updated|OrgAdminHandle|OrgTechHandle|OrgAbuseHandle|Source|
+---------+--------------------+-----------+--------------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
|C05709929|      Allen Matthews|       null|           PO Box399|Simpsonville|        MD|     US|     21150|2015-05-02 00:00:00|2015-05-02 00:00:00|          null|         null|          null|  ARIN|
|C07025896|BIANCA BIANCA-180...|       null|     Private Address|       Plano|        TX|     US|     75075|2018-07-19 00:00:00|2018-07-19 00:00:00|          null|         null|          null|  ARIN|
|  TBL-353|TEST BVOIP COMPAN...|       null|225 W RANDOLPH UN...|        CHGO|        IL|     US|     99774|2015-05-02 00:00:00|2015-05-02 00:00:00|  SHRES56-ARIN| SHRES56-ARIN|  SHRES56-ARIN|  ARIN|
|  AIM-109|ASHLEY INDUSTRIAL...|       null|      951 2ND AVE SE|     OELWEIN|        IA|     US|     50662|2015-05-02 00:00:00|2015-05-02 00:00:00|  MARTZ16-ARIN| MARTZ16-ARIN|  MARTZ16-ARIN|  ARIN|
|C07025664|Brodynt Global Se...|       null|2500 William Park...|    Brampton|        ON|     CA|   L6S 5M9|2018-07-19 00:00:00|2018-07-19 00:00:00|          null|         null|          null|  ARIN|
+---------+--------------------+-----------+--------------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
only showing top 5 rows

asn.show(5)
+--------+---------+------------+--------+-------------------+--------------------+-------------------+------+
|ASHandle|    OrgID|      ASName|ASNumber|            RegDate|             Comment|            Updated|Source|
+--------+---------+------------+--------+-------------------+--------------------+-------------------+------+
|     AS0|     IANA| IANA-RSVD-0|       0|2002-09-13 00:00:00|Reserved - May be...|2002-09-13 00:00:00|  ARIN|
|     AS1|  LPL-141|      LVLT-1|       1|2001-09-20 00:00:00|                null|2018-02-20 00:00:00|  ARIN|
|     AS2|UNIVER-19|    UDEL-DCN|       2|1991-01-10 00:00:00|                null|2012-06-21 00:00:00|  ARIN|
|     AS3|    MIT-2|MIT-GATEWAYS|       3|1970-01-01 00:00:00|                null|2010-09-27 00:00:00|  ARIN|
|     AS4|   USC-32|      ISI-AS|       4|1984-02-22 00:00:00|                null|2012-03-13 00:00:00|  ARIN|
+--------+---------+------------+--------+-------------------+--------------------+-------------------+------+
only showing top 5 rows

您可以看到我的第一次连接似乎与网络数据帧network.count() 3418057 organization.count() 3660886 asn.count() 27745 的计数相同:

3418057

编辑:更正的错字

df = network.join(organization, ["OrgID"], 'leftouter')
df.show(2)
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
|  OrgID|           NetHandle|          Parent|             NetName|            NetRange|     NetType|Comment|            RegDate|            Updated|Source|             OrgName|CanAllocate|           Street|        City|State/Prov|Country|PostalCode|            RegDate|            Updated|OrgAdminHandle|OrgTechHandle|OrgAbuseHandle|Source|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
| 3DIMEN|  NET-199-33-182-0-1| NET-199-0-0-0-0|             NET-3DP|199.33.182.0 - 19...|  assignment|   null|1994-01-11 00:00:00|1994-06-21 00:00:00|  ARIN|                null|       null|             null|Philadelphia|        PA|     US|     19104|1994-01-11 00:00:00|2011-09-24 00:00:00|    JM143-ARIN|   JM143-ARIN|    JM143-ARIN|  ARIN|
|AA-1166|NET6-2001-1890-13...|NET6-2001-1890-1|ATTW-2001-1890-13...|2001:1890:131E:6D00:|reallocation|   null|2016-02-29 00:00:00|2016-02-29 00:00:00|  ARIN|AMERICAN ACCESSORIES|       null|3100 BANDINI BLVD|      VERNON|        CA|     US|     40456|2016-02-29 00:00:00|2016-02-29 00:00:00|   DURAZ5-ARIN|  DURAZ5-ARIN|   DURAZ5-ARIN|  ARIN|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
only showing top 2 rows

但是当我使用新的数据框并对print(df.count()) [Stage 52:====================================================> (193 + 7) / 200]3418057 数据框执行左外部联接时,我应该得到一个asn的计数,但是我得到了一个3418057的计数:

1661797

此数据帧的计数应为df1 = df.join(asn, ["OrgID"], 'leftouter') df1.show(2) +-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+--------+------+--------+-------+-------+-------+------+ | OrgID| NetHandle| Parent| NetName| NetRange| NetType|Comment| RegDate| Updated|Source| OrgName|CanAllocate| Street| City|State/Prov|Country|PostalCode| RegDate| Updated|OrgAdminHandle|OrgTechHandle|OrgAbuseHandle|Source|ASHandle|ASName|ASNumber|RegDate|Comment|Updated|Source| +-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+--------+------+--------+-------+-------+-------+------+ | 3DIMEN| NET-199-33-182-0-1| NET-199-0-0-0-0| NET-3DP|199.33.182.0 - 19...| assignment| null|1994-01-11 00:00:00|1994-06-21 00:00:00| ARIN| null| null| null|Philadelphia| PA| US| 19104|1994-01-11 00:00:00|2011-09-24 00:00:00| JM143-ARIN| JM143-ARIN| JM143-ARIN| ARIN| null| null| null| null| null| null| null| |AA-1166|NET6-2001-1890-13...|NET6-2001-1890-1|ATTW-2001-1890-13...|2001:1890:131E:6D00:|reallocation| null|2016-02-29 00:00:00|2016-02-29 00:00:00| ARIN|AMERICAN ACCESSORIES| null|3100 BANDINI BLVD| VERNON| CA| US| 40456|2016-02-29 00:00:00|2016-02-29 00:00:00| DURAZ5-ARIN| DURAZ5-ARIN| DURAZ5-ARIN| ARIN| null| null| null| null| null| null| null| +-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+--------+------+--------+-------+-------+-------+------+ only showing top 2 rows print(df1.count()) [Stage 70:==================================================> (187 + 7) / 200]4987448 ,而不是3418057。我在做什么错了?

0 个答案:

没有答案