如何在PySpark ML中创建自定义标记生成器

时间:2018-01-16 09:56:30

标签: python apache-spark pyspark spark-dataframe apache-spark-mllib

sentenceDataFrame = spark.createDataFrame([
        (0, "Hi I heard about Spark"),
        (1, "I wish Java could use case classes"),
        (2, "Logistic,regression,models,are,neat")
    ], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words") 
tokenized = tokenizer.transform(sentenceDataFrame)

如果我运行命令

tokenized.head()

我想得到像这样的结果

Row(id=0, sentence='Hi I heard about Spark',
    words=['H','i',' ','h','e',‘a’,……])

但是,现在的结果就是

Row(id=0, sentence='Hi I heard about Spark',
    words=['Hi','I','heard','about','spark'])

PySpark中的Tokenizer或RegexTokenizer有没有办法实现这个目标?

类似的问题在这里:Create a custom Transformer in PySpark ML

1 个答案:

答案 0 :(得分:1)

查看pyspark.ml documentationTokenizer仅按空格分割,但RegexTokenizer - 正如名称所示 - 使用正则表达式查找要提取的分割点或标记(可以通过参数{{1}进行配置}})。

如果你传递一个空模式并离开gaps(这是默认设置),你应该得到你想要的结果:

gaps=True