PySpark:使用Tokenizer映射单词

时间:2017-12-27 13:31:59

标签: python-3.x apache-spark pyspark apache-spark-sql spark-dataframe

我正在与PySpark开始我的旅程,并且我在以下方面坚持了一点: 我有这样的代码:(我是从https://spark.apache.org/docs/2.1.0/ml-features.html

获取的
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
# alternatively, pattern="\\w+", gaps(False)

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") \
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

我正在添加这样的东西:

test = sqlContext.createDataFrame([
    (0, "spark"),
    (1, "java"),
    (2, "i")
], ["id", "word"])

输出是:

id |sentence                           |words                                     |tokens|
+---+-----------------------------------+------------------------------------------+------+
|0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |5     |
|1  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7     |
|2  |Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5     |

我有可能实现这样的目标: [Id来自'test',Id来自'regexTokenized']

2, 0
2, 1
1, 1
0, 1

从'test'的列表中我可以从'regexTokenized'中获取ID,其中可以在两个数据集之间映射标记化的'words'? 或者也许应该采取另一种解决方案?

提前谢谢你的任何帮助:)

1 个答案:

答案 0 :(得分:0)

explodejoin

 from pyspark.sql.functions import explode

(testTokenized.alias("train")
    .select("id", explode("words").alias("word"))
    .join(
        trainTokenized.select("id", explde("words").alias("word")).alias("test"), 
        "word"))