我是波兰的博士生。我有一个关于Apache Spark / Pyspark 2的问题。如何使用Apache Spark / PySpark 2获得3个最小的独特(唯一文本,而不是长度)行的大csv文件(> 10百万行)?
dat.csv csv文件示例:
name,id
abc,1
abcd,2
abcde,3
ab,4
ab,4
1获取数据框中每个唯一行的长度列表:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
sql_context = SQLContext(sc)
df = sql_context.read.csv(
path="/home/rkorniichuk/data.csv", sep=',', encoding="UTF-8",
quote='"', escape='"', header=True, inferSchema=True,
ignoreLeadingWhiteSpace=True, ignoreTrailingWhiteSpace=False,
mode="FAILFAST")
def get_row_lenght(row):
lenght = 0
for column in row:
lenght += len(str(column))
return lenght`
rows_lenght_list = [df.foreach(get_row_lenght)]`
>>> rows_length_list
>>> [None]
我们在这里遇到问题,因为我想将rows_length_list
值填充为[4, 5, 6, 3, 3]
。
2排序rows_length_list
:
rows_length_list.sort()
>>> rows_length_list
>>> [3, 4, 5, 6]
3获取csv文件样本行的最大值:
>>> rows_length_list[3-1]
>>> 5
4获取长度<= 5个字符的3个样本:
abc,1 # TRUE
abcd,2 # TRUE
abcde,3 # FALSE
ab,4 # TRUE and BREAK
ab,4
我是否只能使用Data Frame(没有SQL请求)来实现它?
答案 0 :(得分:1)
You can use concat()
to concatenate all columns into one string, wrapped inside length()
to calculate the length of the resulting new variable:
from pyspark.sql.functions import concat, length, col
df.withColumn("row_len", length(concat(*df.columns))) \
.filter(col("row_len") <= 5) \
.dropDuplicates() \
.sort("row_len") \
.show()
+----+---+-------+
|name| id|row_len|
+----+---+-------+
| ab| 4| 3|
| abc| 1| 4|
|abcd| 2| 5|
+----+---+-------+
If you have more than 3
rows, you could use .take(3)
to instead of .show()
to get the 3 unique rows with the smallest row_len
.