如何索引Spark CoreNLP分析?

时间:2017-09-11 14:28:24

标签: scala apache-spark stanford-nlp

我一直在使用Stanford CoreNLP wrapper for Apache Spark进行NEP分析,发现效果很好。但是,我想将这个简单的示例扩展到我可以将分析映射回原始数据帧ID的位置。请参阅下文,我在简单示例中添加了两行。

<local:My2x2Grid>
    <TextBox />
    <TextBox Grid.Column="1" />
</Grid>

然后,我可以通过Spark CoreNLP包装器运行此数据帧,以进行情绪分析和NEP分析。

val input = Seq(
  (1, "<xml>Apple is located in California. It is a great company.</xml>"),
  (2, "<xml>Google is located in California. It is a great company.</xml>"),
  (3, "<xml>Netflix is located in California. It is a great company.</xml>")
).toDF("id", "text")

input.show()

input: org.apache.spark.sql.DataFrame = [id: int, text: string]
+---+--------------------+
| id|                text|
+---+--------------------+
|  1|<xml>Apple is loc...|
|  2|<xml>Google is lo...|
|  3|<xml>Netflix is l...|
+---+--------------------+

但是,在下面的输出中,我已经失去了与原始数据帧行ID的连接。

val output = input
  .select(cleanxml('text).as('doc))
  .select(explode(ssplit('doc)).as('sen))
  .select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))

理想情况下,我需要以下内容:

+--------------------+--------------------+--------------------+---------+
|                 sen|               words|             nerTags|sentiment|
+--------------------+--------------------+--------------------+---------+
|Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...|        2|
|It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
|Google is located...|[Google, is, loca...|[ORGANIZATION, O,...|        3|
|It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
|Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...|        3|
|It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
+--------------------+--------------------+--------------------+---------+

我曾尝试创建UDF,但无法使其正常工作。

1 个答案:

答案 0 :(得分:0)

使用Stanford CoreNLP wrapper for Apache Spark中定义的UDF,您可以使用以下代码生成所需的输出

val output = input.withColumn("doc", cleanxml('text).as('doc))
  .withColumn("sen", ssplit('doc).as('sen))
  .withColumn("sen", explode($"sen"))
  .withColumn("words", tokenize('sen).as('words))
  .withColumn("ner", ner('sen).as('nerTags))
  .withColumn("sentiment", sentiment('sen).as('sentiment))
  .drop("text")
  .drop("doc").show()

将生成以下Dataframe

+--+---------------------+--------------------+--------------------+---------+
|id|                  sen|               words|             nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...|        2|
| 1| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...|        3|
| 2| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...|        3|
| 3| It is a great com...|[It, is, a, great...|  [O, O, O, O, O, O]|        4|
+--+---------------------+--------------------+--------------------+---------+