我正在与PySpark开始我的旅程,并且我在以下方面坚持了一点: 我有这样的代码:(我是从https://spark.apache.org/docs/2.1.0/ml-features.html)
获取的from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
# alternatively, pattern="\\w+", gaps(False)
countTokens = udf(lambda words: len(words), IntegerType())
tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
.withColumn("tokens", countTokens(col("words"))).show(truncate=False)
regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") \
.withColumn("tokens", countTokens(col("words"))).show(truncate=False)
我正在添加这样的东西:
test = sqlContext.createDataFrame([
(0, "spark"),
(1, "java"),
(2, "i")
], ["id", "word"])
输出是:
id |sentence |words |tokens|
+---+-----------------------------------+------------------------------------------+------+
|0 |Hi I heard about Spark |[hi, i, heard, about, spark] |5 |
|1 |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7 |
|2 |Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5 |
我有可能实现这样的目标: [Id来自'test',Id来自'regexTokenized']
2, 0
2, 1
1, 1
0, 1
从'test'的列表中我可以从'regexTokenized'中获取ID,其中可以在两个数据集之间映射标记化的'words'? 或者也许应该采取另一种解决方案?
提前谢谢你的任何帮助:)
答案 0 :(得分:0)
explode
和join
:
from pyspark.sql.functions import explode
(testTokenized.alias("train")
.select("id", explode("words").alias("word"))
.join(
trainTokenized.select("id", explde("words").alias("word")).alias("test"),
"word"))