我一直在使用Stanford CoreNLP wrapper for Apache Spark进行NEP分析,发现效果很好。但是,我想将这个简单的示例扩展到我可以将分析映射回原始数据帧ID的位置。请参阅下文,我在简单示例中添加了两行。
<local:My2x2Grid>
<TextBox />
<TextBox Grid.Column="1" />
</Grid>
然后,我可以通过Spark CoreNLP包装器运行此数据帧,以进行情绪分析和NEP分析。
val input = Seq(
(1, "<xml>Apple is located in California. It is a great company.</xml>"),
(2, "<xml>Google is located in California. It is a great company.</xml>"),
(3, "<xml>Netflix is located in California. It is a great company.</xml>")
).toDF("id", "text")
input.show()
input: org.apache.spark.sql.DataFrame = [id: int, text: string]
+---+--------------------+
| id| text|
+---+--------------------+
| 1|<xml>Apple is loc...|
| 2|<xml>Google is lo...|
| 3|<xml>Netflix is l...|
+---+--------------------+
但是,在下面的输出中,我已经失去了与原始数据帧行ID的连接。
val output = input
.select(cleanxml('text).as('doc))
.select(explode(ssplit('doc)).as('sen))
.select('sen, tokenize('sen).as('words), ner('sen).as('nerTags), sentiment('sen).as('sentiment))
理想情况下,我需要以下内容:
+--------------------+--------------------+--------------------+---------+
| sen| words| nerTags|sentiment|
+--------------------+--------------------+--------------------+---------+
|Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
|Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
|It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--------------------+--------------------+--------------------+---------+
我曾尝试创建UDF,但无法使其正常工作。
答案 0 :(得分:0)
使用Stanford CoreNLP wrapper for Apache Spark中定义的UDF,您可以使用以下代码生成所需的输出
val output = input.withColumn("doc", cleanxml('text).as('doc))
.withColumn("sen", ssplit('doc).as('sen))
.withColumn("sen", explode($"sen"))
.withColumn("words", tokenize('sen).as('words))
.withColumn("ner", ner('sen).as('nerTags))
.withColumn("sentiment", sentiment('sen).as('sentiment))
.drop("text")
.drop("doc").show()
将生成以下Dataframe
+--+---------------------+--------------------+--------------------+---------+
|id| sen| words| nerTags|sentiment|
+--+---------------------+--------------------+--------------------+---------+
| 1| Apple is located ...|[Apple, is, locat...|[ORGANIZATION, O,...| 2|
| 1| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 2| Google is located...|[Google, is, loca...|[ORGANIZATION, O,...| 3|
| 2| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
| 3| Netflix is locate...|[Netflix, is, loc...|[ORGANIZATION, O,...| 3|
| 3| It is a great com...|[It, is, a, great...| [O, O, O, O, O, O]| 4|
+--+---------------------+--------------------+--------------------+---------+