如何在不使用SQL的REPLACE()函数的情况下找到spark数据帧列中的单词数?下面是我正在使用的代码和输入,但replace()函数不起作用。
from pyspark.sql import SparkSession
my_spark = SparkSession \
.builder \
.appName("Python Spark SQL example") \
.enableHiveSupport() \
.getOrCreate()
parqFileName = 'gs://caserta-pyspark-eval/train.pqt'
tuesdayDF = my_spark.read.parquet(parqFileName)
tuesdayDF.createOrReplaceTempView("parquetFile")
tuesdaycrimes = spark.sql("SELECT LENGTH(Address) - LENGTH(REPLACE(Address, ' ', ''))+1 FROM parquetFile")
print(tuesdaycrimes.show())
+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
| Dates| Category| Descript|DayOfWeek|PdDistrict| Resolution| Address| X| Y|
+-------------------+--------------+--------------------+---------+----------+--------------+--------------------+-----------+---------+
|2015-05-14 03:53:00| WARRANTS| WARRANT ARREST|Wednesday| NORTHERN|ARREST, BOOKED| OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:53:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday| NORTHERN|ARREST, BOOKED| OAK ST / LAGUNA ST| -122.42589|37.774597|
|2015-05-14 03:33:00|OTHER OFFENSES|TRAFFIC VIOLATION...|Wednesday| NORTHERN|ARREST, BOOKED|VANNESS AV / GREE...| -122.42436|37.800415|
答案 0 :(得分:10)
使用pyspark DataFrame函数计算单词的方法有很多,具体取决于您要查找的内容。
创建示例数据
import pyspark.sql.functions as f
data = [
("2015-05-14 03:53:00", "WARRANT ARREST"),
("2015-05-14 03:53:00", "TRAFFIC VIOLATION"),
("2015-05-14 03:33:00", "TRAFFIC VIOLATION")
]
df = sqlCtx.createDataFrame(data, ["Dates", "Description"])
df.show()
在此示例中,我们将计算Description
列中的字词。
计算每一行
如果您想要为每行指定列中的单词计数,可以使用withColumn()
创建新列并执行以下操作:
pyspark.sql.functions.split()
将字符串分解为列表pyspark.sql.functions.size()
计算列表的长度例如:
df = df.withColumn('wordCount', f.size(f.split(f.col('Description'), ' ')))
df.show()
#+-------------------+-----------------+---------+
#| Dates| Description|wordCount|
#+-------------------+-----------------+---------+
#|2015-05-14 03:53:00| WARRANT ARREST| 2|
#|2015-05-14 03:53:00|TRAFFIC VIOLATION| 2|
#|2015-05-14 03:33:00|TRAFFIC VIOLATION| 2|
#+-------------------+-----------------+---------+
对所有行计算字数
如果您想计算整个DataFrame中列中的字词总数,可以使用pyspark.sql.functions.sum()
:
df.select(f.sum('wordCount')).collect()
#[Row(sum(wordCount)=6)]
计算每个单词的出现次数
如果您想要整个数据框中每个单词的计数,可以使用split()
和pyspark.sql.function.explode()
,然后使用groupBy
和count()
。
df.withColumn('word', f.explode(f.split(f.col('Description'), ' ')))\
.groupBy('word')\
.count()\
.sort('count', ascending=False)\
.show()
#+---------+-----+
#| word|count|
#+---------+-----+
#| TRAFFIC| 2|
#|VIOLATION| 2|
#| WARRANT| 1|
#| ARREST| 1|
#+---------+-----+
答案 1 :(得分:1)
您只需使用split
和size
pyspark API
函数即可(以下是示例): -
sqlContext.createDataFrame([['this is a sample address'],['another address']])\
.select(F.size(F.split(F.col("_1"), " "))).show()
Below is Output:-
+------------------+
|size(split(_1, ))|
+------------------+
| 5|
| 2|
+------------------+
答案 2 :(得分:0)
tuesdaycrimes.select("Address").map(x->x.split(" ")).flatmap().count()
答案 3 :(得分:0)
您可以将udf
功能定义为
def splitAndCountUdf(x):
return len(x.split(" "))
from pyspark.sql import functions as F
countWords = F.udf(splitAndCountUdf, 'int')
并使用.withColumn
函数将其命名为
tuesdayDF.withColumn("wordCount", countWords(tuesdayDF.address))
如果您想要不同的单词数,可以将udf
函数更改为包含set
def splitAndCountUdf(x):
return len(set(x.split(" ")))
from pyspark.sql import functions as F
countWords = F.udf(splitAndCountUdf, 'int')