Question

我有一个CoNLL-U格式的文本文件，需要提取Token_Label。文件示例：

# newdoc id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713
# sent_id = weblog-blogspot.com_nominations_20041117172713_ENG_20041117_172713-0001
# text = From the AP comes this story :
1   From    from    ADP IN  _   3   case    3:case  _
2   the the DET DT  Definite=Def|PronType=Art   3   det 3:det   _
3   AP  AP  PROPN   NNP Number=Sing 4   obl 4:obl:from  _
4   comes   come    VERB    VBZ Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   0   root    0:root  _
5   this    this    DET DT  Number=Sing|PronType=Dem    6   det 6:det   _
6   story   story   NOUN    NN  Number=Sing 4   nsubj   4:nsubj _
7   :   :   PUNCT   :   _   4   punct   4:punct _

# sent_id = weblog-juancole.com_juancole_20040324065800_ENG_20040324_065800-0005
# text = In Ramadi, there was a big demonstration.
1   In  in  ADP IN  _   2   case    2:case  _
2   Ramadi  Ramadi  PROPN   NNP Number=Sing 5   obl 5:obl:in    SpaceAfter=No
3   ,   ,   PUNCT   ,   _   5   punct   5:punct _
4   there   there   PRON    EX  _   5   expl    5:expl  _
5   was be  VERB    VBD Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin   0   root    0:root  _
6   a   a   DET DT  Definite=Ind|PronType=Art   8   det 8:det   _
7   big big ADJ JJ  Degree=Pos  8   amod    8:amod  _
8   demonstration   demonstration   NOUN    NN  Number=Sing 5   nsubj   5:nsubj SpaceAfter=No
9   .   .   PUNCT   .   _   5   punct   5:punct _

如您所见，每个句子都被标记了符号，并在每个标记之前添加了一些标签，并用制表符\t分隔了这些标记（标记，引理，UD POS等），并且每个句子都由一个空行分隔

为了获得每个句子的Token_POS，我使用此代码将每个句子的文本生成为What_PRON if_SCON...，然后将其转换为Dataframe，以便可以使用withColumn来将标记和标签提取为分隔的列，作为我项目的数组类型。

val testPath = "en_ewt-ud-test.conllu"
val testInput = spark.read.text(testPath).as[String]

val extractedTokensTags = testInput.map(s => s.split("\t")
.filter(x => !x.startsWith("#"))).filter(x => x.length > 0)
.map{x => if(x.length > 1){x(1) + "_" + x(3)} else{"endOfLine"}}
.map(x => x.mkString)
.reduce((s1, s2) => s1 + " " + s2).split(" endOfLine | endOfLine")

spark.sparkContext.parallelize(extractedTokensTags).toDF("arrays").show

|              arrays|
+--------------------+
|What_PRON if_SCON...|
|What_PRON if_SCON...|
|[_PUNCT via_ADP M...|
|(_PUNCT And_CCONJ...|
|This_DET BuzzMach...|
|Google_PROPN is_A...|
|Does_AUX anybody_...|
|They_PRON own_VER...|

此代码是绝对的hack！它甚至看起来很难看，但是它确实完成了工作，并给了我直到现在为止我想要的东西！

问题：

如果文件很大，则 reduce 部分将创建多个任务，这将导致不保留行顺序。（我想我可能会打乱许多洗牌或任务，但只做一次破解就足够了！）

问题：

如何根据空行对行进行分组？（我想摆脱endOfLine和.map中的.reduce hack）
是否可以对每个节的每一行使用具有唯一索引的zipWithIndex，这样最后我可以在不关心顺序的情况下在我的Dataframe中使用reduceByKey或使用相同的ID？
仅通过Spark SQL API是否有更好的方法？

给定示例的预期结果：

Array [String]，所以我可以将其并行化为DataFrame

Array [String] = Array（From_ADP the_DET AP_PROPN comes_VERB this_DET story_NOUN：_PUNCT）

Array [String] = Array（In_ ADP Ramadi_PROPN，_PUNCT there_PRON was_VERB a_DET big_ADJ演示_NOUN ._PUNCT）

或

具有2列的数据框：

令牌：Array [String] =（来自AP来的故事）：

标签：Array [String] =（ADP，DET，PROPN，VERB，DET，NOUN，PUNCT）

如果我可以得到这两个结果中的任何一个，我可以处理其余的事情。我的主要问题是不知道如何使用空行作为分隔符或某种分隔符来对行进行分组，第二个问题是按ID或逐行保存顺序。

非常感谢。

更新：Parsing multiline records in Scala

我确实看到并尝试了其他有关解析以\\n为分隔符的多行文本文件的问题。我已经替换了\\n，这在我的数据集中是没有的，所以我希望 1.留在Spark内部（如您所见，这是可能的） 2.找到一种方法来使reduce不重新排序或为每行添加唯一的ID，以便我可以保留该顺序

Spark RDD或SQL API中的每个空行将多行收集到一个数组中

0 个答案: