Question

我有一个很大的数据框，其中一个名为location的列仅包含少量城市，例如：["New York", "London", "Paris", "Berlin"...]。

我想在该列上打印所有不同的值，以便知道例如某个城市的值是否丢失。由于.describe('location')方法没有帮助，我该怎么做？

Answer 1

使用此方法，您无法在location列中打印不同的值

from pyspark.sql import functions as F
df.select(F.col('location')).distinct()

Answer 2

describe方法用于基本的预定义统计信息，例如count，mean，std，min，max等。但是，要查找任何列的不同值，可以使用distinct()方法。

希望这会有所帮助。

此致

Neeraj

Answer 3

我找到了：

"${PODS_ROOT}/Fabric/run"