当前,我正在使用3个数据框,并将它们从network
数据框开始,然后将organization
数据框与其连接在一起,并使用{{1 }}列。然后使用新的数据框并将OrgID
数据框连接到它,使用asn
进行连接以执行左外部连接。
数据帧:
OrgID
这些是3个spark数据帧的计数:
network.show(5)
+--------------------+---------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+
| NetHandle| OrgID| Parent| NetName| NetRange| NetType|Comment| RegDate| Updated|Source|
+--------------------+---------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------ +
|NET-69-150-149-184-1|C00868859|NET-69-148-0-0-1|SBC06915014918429...|69.150.149.184 - ...|reassignment| null|2004-07-23 00:00:00|2004-07-23 00:00:00| ARIN|
| NET-69-224-242-40-1|C00868860|NET-69-224-0-0-1|SBC06922424204029...|69.224.242.40 - 6...|reassignment| null|2004-07-23 00:00:00|2004-07-23 00:00:00| ARIN|
| NET-170-55-30-176-1| CC-3105|NET-170-55-0-0-1|FPLFI-CROWNSCFSW-...|170.55.30.176 - 1...|reassignment| null|2018-03-26 00:00:00|2018-03-26 00:00:00| ARIN|
| NET-69-224-249-24-1|C00868862|NET-69-224-0-0-1|SBC06922424902429...|69.224.249.24 - 6...|reassignment| null|2004-07-23 00:00:00|2004-07-23 00:00:00| ARIN|
| NET-69-29-107-152-1|C02309164| NET-69-29-0-0-1| CTEL-CITZENS-BANK|69.29.107.152 - 6...|reassignment| null|2009-09-03 00:00:00|2009-09-03 00:00:00| ARIN|
+--------------------+---------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+
only showing top 5 rows
organization.show(5)
+---------+--------------------+-----------+--------------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
| OrgID| OrgName|CanAllocate| Street| City|State/Prov|Country|PostalCode| RegDate| Updated|OrgAdminHandle|OrgTechHandle|OrgAbuseHandle|Source|
+---------+--------------------+-----------+--------------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
|C05709929| Allen Matthews| null| PO Box399|Simpsonville| MD| US| 21150|2015-05-02 00:00:00|2015-05-02 00:00:00| null| null| null| ARIN|
|C07025896|BIANCA BIANCA-180...| null| Private Address| Plano| TX| US| 75075|2018-07-19 00:00:00|2018-07-19 00:00:00| null| null| null| ARIN|
| TBL-353|TEST BVOIP COMPAN...| null|225 W RANDOLPH UN...| CHGO| IL| US| 99774|2015-05-02 00:00:00|2015-05-02 00:00:00| SHRES56-ARIN| SHRES56-ARIN| SHRES56-ARIN| ARIN|
| AIM-109|ASHLEY INDUSTRIAL...| null| 951 2ND AVE SE| OELWEIN| IA| US| 50662|2015-05-02 00:00:00|2015-05-02 00:00:00| MARTZ16-ARIN| MARTZ16-ARIN| MARTZ16-ARIN| ARIN|
|C07025664|Brodynt Global Se...| null|2500 William Park...| Brampton| ON| CA| L6S 5M9|2018-07-19 00:00:00|2018-07-19 00:00:00| null| null| null| ARIN|
+---------+--------------------+-----------+--------------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
only showing top 5 rows
asn.show(5)
+--------+---------+------------+--------+-------------------+--------------------+-------------------+------+
|ASHandle| OrgID| ASName|ASNumber| RegDate| Comment| Updated|Source|
+--------+---------+------------+--------+-------------------+--------------------+-------------------+------+
| AS0| IANA| IANA-RSVD-0| 0|2002-09-13 00:00:00|Reserved - May be...|2002-09-13 00:00:00| ARIN|
| AS1| LPL-141| LVLT-1| 1|2001-09-20 00:00:00| null|2018-02-20 00:00:00| ARIN|
| AS2|UNIVER-19| UDEL-DCN| 2|1991-01-10 00:00:00| null|2012-06-21 00:00:00| ARIN|
| AS3| MIT-2|MIT-GATEWAYS| 3|1970-01-01 00:00:00| null|2010-09-27 00:00:00| ARIN|
| AS4| USC-32| ISI-AS| 4|1984-02-22 00:00:00| null|2012-03-13 00:00:00| ARIN|
+--------+---------+------------+--------+-------------------+--------------------+-------------------+------+
only showing top 5 rows
您可以看到我的第一次连接似乎与网络数据帧network.count()
3418057
organization.count()
3660886
asn.count()
27745
的计数相同:
3418057
编辑:更正的错字
df = network.join(organization, ["OrgID"], 'leftouter')
df.show(2)
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
| OrgID| NetHandle| Parent| NetName| NetRange| NetType|Comment| RegDate| Updated|Source| OrgName|CanAllocate| Street| City|State/Prov|Country|PostalCode| RegDate| Updated|OrgAdminHandle|OrgTechHandle|OrgAbuseHandle|Source|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
| 3DIMEN| NET-199-33-182-0-1| NET-199-0-0-0-0| NET-3DP|199.33.182.0 - 19...| assignment| null|1994-01-11 00:00:00|1994-06-21 00:00:00| ARIN| null| null| null|Philadelphia| PA| US| 19104|1994-01-11 00:00:00|2011-09-24 00:00:00| JM143-ARIN| JM143-ARIN| JM143-ARIN| ARIN|
|AA-1166|NET6-2001-1890-13...|NET6-2001-1890-1|ATTW-2001-1890-13...|2001:1890:131E:6D00:|reallocation| null|2016-02-29 00:00:00|2016-02-29 00:00:00| ARIN|AMERICAN ACCESSORIES| null|3100 BANDINI BLVD| VERNON| CA| US| 40456|2016-02-29 00:00:00|2016-02-29 00:00:00| DURAZ5-ARIN| DURAZ5-ARIN| DURAZ5-ARIN| ARIN|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+
only showing top 2 rows
但是当我使用新的数据框并对print(df.count())
[Stage 52:====================================================> (193 + 7) / 200]3418057
数据框执行左外部联接时,我应该得到一个asn
的计数,但是我得到了一个3418057
的计数:>
1661797
此数据帧的计数应为df1 = df.join(asn, ["OrgID"], 'leftouter')
df1.show(2)
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+--------+------+--------+-------+-------+-------+------+
| OrgID| NetHandle| Parent| NetName| NetRange| NetType|Comment| RegDate| Updated|Source| OrgName|CanAllocate| Street| City|State/Prov|Country|PostalCode| RegDate| Updated|OrgAdminHandle|OrgTechHandle|OrgAbuseHandle|Source|ASHandle|ASName|ASNumber|RegDate|Comment|Updated|Source|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+--------+------+--------+-------+-------+-------+------+
| 3DIMEN| NET-199-33-182-0-1| NET-199-0-0-0-0| NET-3DP|199.33.182.0 - 19...| assignment| null|1994-01-11 00:00:00|1994-06-21 00:00:00| ARIN| null| null| null|Philadelphia| PA| US| 19104|1994-01-11 00:00:00|2011-09-24 00:00:00| JM143-ARIN| JM143-ARIN| JM143-ARIN| ARIN| null| null| null| null| null| null| null|
|AA-1166|NET6-2001-1890-13...|NET6-2001-1890-1|ATTW-2001-1890-13...|2001:1890:131E:6D00:|reallocation| null|2016-02-29 00:00:00|2016-02-29 00:00:00| ARIN|AMERICAN ACCESSORIES| null|3100 BANDINI BLVD| VERNON| CA| US| 40456|2016-02-29 00:00:00|2016-02-29 00:00:00| DURAZ5-ARIN| DURAZ5-ARIN| DURAZ5-ARIN| ARIN| null| null| null| null| null| null| null|
+-------+--------------------+----------------+--------------------+--------------------+------------+-------+-------------------+-------------------+------+--------------------+-----------+-----------------+------------+----------+-------+----------+-------------------+-------------------+--------------+-------------+--------------+------+--------+------+--------+-------+-------+-------+------+
only showing top 2 rows
print(df1.count())
[Stage 70:==================================================> (187 + 7) / 200]4987448
,而不是3418057
。我在做什么错了?