如何像使用熊猫数据框那样查看Spark数据框中每种数据类型的计数?
例如,假设df是熊猫数据帧:
>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
int_col 5 non-null int64
text_col 5 non-null object
float_col 5 non-null float64
**dtypes: float64(1), int64(1), object(1)**
memory usage: 200.0+ bytes
我们可以非常清楚地看到每种数据类型的计数。如何使用Spark数据框执行类似操作?也就是说,如何看到有多少列是浮动的,有多少列是int的,有多少列是对象的?
谢谢!
答案 0 :(得分:2)
下面的代码应该可以为您带来理想的结果
# create data frame
df = sqlContext.createDataFrame(
[(1,'Y','Y',0,0,0,2,'Y','N','Y','Y'),
(2,'N','Y',2,1,2,3,'N','Y','Y','N'),
(3,'Y','N',3,1,0,0,'N','N','N','N'),
(4,'N','Y',5,0,1,0,'N','N','N','Y'),
(5,'Y','N',2,2,0,1,'Y','N','N','Y'),
(6,'Y','Y',0,0,3,6,'Y','N','Y','N'),
(7,'N','N',1,1,3,4,'N','Y','N','Y'),
(8,'Y','Y',1,1,2,0,'Y','Y','N','N')
],
('id', 'compatible', 'product', 'ios', 'pc', 'other', 'devices', 'customer', 'subscriber', 'circle', 'smb')
)
# Find data types of data frame
datatypes_List = df.dtypes
# Querying datatypes_List gives you column and its data type as a tuple
datatypes_List
[('id', 'bigint'), ('compatible', 'string'), ('product', 'string'), ('ios', 'bigint'), ('pc', 'bigint'), ('other', 'bigint'), ('devices', 'bigint'), ('customer', 'string'), ('subscriber', 'string'), ('circle', 'string'), ('smb', 'string')]
# create empty dictonary to store output values
dict_count = {}
# Loop statement to count number of times the data type is present in the data frame
for x, y in datatypes_List:
dict_count[y] = dict_count.get(y, 0) + 1
# query dict_count to find the number of times a data type is present in data frame
dict_count
答案 1 :(得分:2)
我认为最简单的方法是使用collections.Counter
:
df = spark.createDataFrame(
[(1, 1.2, 'foo'), (2, 2.3, 'bar'), (None, 3.4, 'baz')],
["int_col", "float_col", "string_col"]
)
from collections import Counter
print(Counter((x[1] for x in df.dtypes)))
#Counter({'double': 1, 'bigint': 1, 'string': 1})
还有pyspark.sql.DataFrame.describe()
方法:
df.describe().show()
+-------+------------------+------------------+----------+
|summary| int_col| float_col|string_col|
+-------+------------------+------------------+----------+
| count| 2| 3| 3|
| mean| 1.5| 2.3| null|
| stddev|0.7071067811865476|1.0999999999999999| null|
| min| 1| 1.2| bar|
| max| 2| 3.4| foo|
+-------+------------------+------------------+----------+
请注意,int_col
的{{1}}为2,因为在此示例中,值之一为count
。
答案 2 :(得分:1)
printSchema
import datetime
df = spark.createDataFrame([("", 1.0, 1, True, datetime.datetime.now())])
df.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: double (nullable = true)
|-- _3: long (nullable = true)
|-- _4: boolean (nullable = true)
|-- _5: timestamp (nullable = true)
或检查dtypes
df.dtypes
[('_1', 'string'),
('_2', 'double'),
('_3', 'bigint'),
('_4', 'boolean'),
('_5', 'timestamp')]