sentenceDataFrame = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(1, "I wish Java could use case classes"),
(2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(sentenceDataFrame)
如果我运行命令
tokenized.head()
我想得到像这样的结果
Row(id=0, sentence='Hi I heard about Spark',
words=['H','i',' ','h','e',‘a’,……])
但是,现在的结果就是
Row(id=0, sentence='Hi I heard about Spark',
words=['Hi','I','heard','about','spark'])
PySpark中的Tokenizer或RegexTokenizer有没有办法实现这个目标?
答案 0 :(得分:1)
查看pyspark.ml documentation。 Tokenizer
仅按空格分割,但RegexTokenizer
- 正如名称所示 - 使用正则表达式查找要提取的分割点或标记(可以通过参数{{1}进行配置}})。
如果你传递一个空模式并离开gaps
(这是默认设置),你应该得到你想要的结果:
gaps=True