Question

我是波兰的博士生。我有一个关于Apache Spark / Pyspark 2的问题。如何使用Apache Spark / PySpark 2获得3个最小的独特（唯一文本，而不是长度）行的大csv文件（> 10百万行）？

dat.csv csv文件示例：

name,id
abc,1
abcd,2
abcde,3
ab,4
ab,4

1获取数据框中每个唯一行的长度列表：

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

conf = SparkConf().setMaster("local[*]")
sc = SparkContext(conf=conf)
sql_context = SQLContext(sc)
df = sql_context.read.csv(
        path="/home/rkorniichuk/data.csv", sep=',', encoding="UTF-8",
        quote='"', escape='"', header=True, inferSchema=True,
        ignoreLeadingWhiteSpace=True, ignoreTrailingWhiteSpace=False,
        mode="FAILFAST")

def get_row_lenght(row):
    lenght = 0
    for column in row:
        lenght += len(str(column))
    return lenght`

rows_lenght_list = [df.foreach(get_row_lenght)]`

>>> rows_length_list
>>> [None]

我们在这里遇到问题，因为我想将rows_length_list值填充为[4, 5, 6, 3, 3]。

2排序rows_length_list：

rows_length_list.sort()

>>> rows_length_list
>>> [3, 4, 5, 6]

3获取csv文件样本行的最大值：

>>> rows_length_list[3-1]
>>> 5

4获取长度<= 5个字符的3个样本：

abc,1 # TRUE
abcd,2 # TRUE
abcde,3 # FALSE
ab,4 # TRUE and BREAK
ab,4

我是否只能使用Data Frame（没有SQL请求）来实现它？

Answer 1

You can use concat() to concatenate all columns into one string, wrapped inside length() to calculate the length of the resulting new variable:

from pyspark.sql.functions import concat, length, col

df.withColumn("row_len", length(concat(*df.columns))) \
  .filter(col("row_len") <= 5) \
  .dropDuplicates() \
  .sort("row_len") \
  .show()
+----+---+-------+
|name| id|row_len|
+----+---+-------+
|  ab|  4|      3|
| abc|  1|      4|
|abcd|  2|      5|
+----+---+-------+

If you have more than 3 rows, you could use .take(3) to instead of .show() to get the 3 unique rows with the smallest row_len.

如何使用Apache Spark / PySpark获得3个最小的独特行的大csv（> 1000万行）文件？

1 个答案: