计算spark数据帧中的单词数

时间:2018-02-22 12:20:06

标签: python apache-spark pyspark apache-spark-sql

如何在不使用SQL的REPLACE()函数的情况下找到spark数据帧列中的单词数?下面是我正在使用的代码和输入,但replace()函数不起作用。

from pyspark.sql import SparkSession
my_spark = SparkSession \
    .builder \
    .appName("Python Spark SQL example") \
    .enableHiveSupport() \
    .getOrCreate()

parqFileName = 'gs://caserta-pyspark-eval/train.pqt'
tuesdayDF = my_spark.read.parquet(parqFileName)

tuesdayDF.createOrReplaceTempView("parquetFile")
tuesdaycrimes = spark.sql("SELECT LENGTH(Address) - LENGTH(REPLACE(Address, ' ', ''))+1 FROM parquetFile")

print(tuesdaycrimes.show())


+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|              Dates|      Category|            Descript|DayOfWeek|PdDistrict|    Resolution|             Address|          X|        Y|
+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|2015-05-14 03:53:00|      WARRANTS|      WARRANT ARREST|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:53:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|  OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:33:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday|  NORTHERN|ARREST, BOOKED|VANNESS AV / GREE...| -122.42436|37.800415|

4 个答案:

答案 0 :(得分:10)

使用pyspark DataFrame函数计算单词的方法有很多,具体取决于您要查找的内容。

创建示例数据

import pyspark.sql.functions as f
data = [
    ("2015-05-14 03:53:00", "WARRANT ARREST"),
    ("2015-05-14 03:53:00", "TRAFFIC VIOLATION"),
    ("2015-05-14 03:33:00", "TRAFFIC VIOLATION")
]

df = sqlCtx.createDataFrame(data, ["Dates", "Description"])
df.show()

在此示例中,我们将计算Description列中的字词。

计算每一行

如果您想要为每行指定列中的单词计数,可以使用withColumn()创建新列并执行以下操作:

例如:

df = df.withColumn('wordCount', f.size(f.split(f.col('Description'), ' ')))
df.show()
#+-------------------+-----------------+---------+
#|              Dates|      Description|wordCount|
#+-------------------+-----------------+---------+
#|2015-05-14 03:53:00|   WARRANT ARREST|        2|
#|2015-05-14 03:53:00|TRAFFIC VIOLATION|        2|
#|2015-05-14 03:33:00|TRAFFIC VIOLATION|        2|
#+-------------------+-----------------+---------+

对所有行计算字数

如果您想计算整个DataFrame中列中的字词总数,可以使用pyspark.sql.functions.sum()

df.select(f.sum('wordCount')).collect() 
#[Row(sum(wordCount)=6)]

计算每个单词的出现次数

如果您想要整个数据框中每个单词的计数,可以使用split()pyspark.sql.function.explode(),然后使用groupBycount()

df.withColumn('word', f.explode(f.split(f.col('Description'), ' ')))\
    .groupBy('word')\
    .count()\
    .sort('count', ascending=False)\
    .show()
#+---------+-----+
#|     word|count|
#+---------+-----+
#|  TRAFFIC|    2|
#|VIOLATION|    2|
#|  WARRANT|    1|
#|   ARREST|    1|
#+---------+-----+

答案 1 :(得分:1)

您只需使用splitsize pyspark API函数即可(以下是示例): -

sqlContext.createDataFrame([['this is a sample address'],['another address']])\
.select(F.size(F.split(F.col("_1"), " "))).show()

Below is Output:-
+------------------+
|size(split(_1,  ))|
+------------------+
|                 5|
|                 2|
+------------------+

答案 2 :(得分:0)

tuesdaycrimes.select("Address").map(x->x.split(" ")).flatmap().count()

答案 3 :(得分:0)

您可以将udf功能定义为

def splitAndCountUdf(x):
    return len(x.split(" "))

from pyspark.sql import functions as F
countWords = F.udf(splitAndCountUdf, 'int')

并使用.withColumn函数将其命名为

tuesdayDF.withColumn("wordCount", countWords(tuesdayDF.address))

如果您想要不同的单词数,可以将udf函数更改为包含set

def splitAndCountUdf(x):
    return len(set(x.split(" ")))

from pyspark.sql import functions as F
countWords = F.udf(splitAndCountUdf, 'int')