Question

使用Python和Spark：

假设我有一个包含句子的行的DataFrame，我怎么能normalize（从DBMS术语）将句子DataFrame转换成另一个DataFrame，每行包含一个从句子中分离出的单词？

例如，假设df_sentences看起来像这样：

[Row(sentence_id=1, sentence=u'the dog ran the fastest.'),
 Row(sentence_id=2, sentence=u'the cat sat down.')]

我正在寻找将df_sentences转换为df_words的转换，它将占用这两行，并构建一个更大的（行数）DataFrame，如下所示。请注意，sentence_id被传递到新表中：

[Row(sentence_id=1, word=u'the'),
 Row(sentence_id=1, word=u'the'),
 Row(sentence_id=1, word=u'fastest'), 
 Row(sentence_id=2, word=u'dog'),
 Row(sentence_id=2, word=u'ran'), 
 Row(sentence_id=2, word=u'cat'), 
 ...clip...]

现在，我对行计数或独特单词并不感兴趣，因为我想加入sentence_id上的其他RDD来获取我存储在其他地方的其他有趣数据。

我怀疑在管道中的这些间歇性转换中有很多能够处理火花，所以我想了解最好的做事方式并开始收集我自己的片段等。

Answer 1

实际上非常简单。让我们从创建DataFrame：

开始

from pyspark.sql import Row

df = sc.parallelize([
    Row(sentence_id=1, sentence=u'the dog ran the fastest.'),
     Row(sentence_id=2, sentence=u'the cat sat down.')
]).toDF()

接下来我们需要一个标记器：

from pyspark.ml.feature import RegexTokenizer

tokenizer = RegexTokenizer(
    inputCol="sentence", outputCol="words", pattern="\\W+")
tokenized = tokenizer.transform(df)

最后我们放弃sentence和explode words：

from pyspark.sql.functions import explode, col

transformed = (tokenized
    .drop("sentence")
    .select(col("sentence_id"), explode(col("words")).alias("word")))

最后结果：

transformed.show()

## +-----------+-------+
## |sentence_id|   word|
## +-----------+-------+
## |          1|    the|
## |          1|    dog|
## |          1|    ran|
## |          1|    the|
## |          1|fastest|
## |          2|    the|
## |          2|    cat|
## |          2|    sat|
## |          2|   down|
## +-----------+-------+

备注：

取决于数据explode可能相当昂贵，因为它复制了其他列。在应用explode之前，请确保应用所有过滤器，例如使用StopWordsRemover

将句子的数据帧“规范化”为更大的单词数据帧

1 个答案: