给出来自jdbc源的输入数据,例如:
doc id
我希望按doc_content
进行分组,并将字符串连接到 [{:doc_id 1 :doc_content "this is a very long sentence from a mainfram system that was built before i was born."}
{:doc_id 2 :doc_content "this is a another very long sentence ... clip..."}
{:doc_id 3 :doc_content "... clip..."}]
,因此我的输出将如下所示:
group-by
我正在考虑使用group-by
然而输出地图,我需要
输出一些懒惰的东西,因为输入数据集可能非常大。也许我可以运行reduce-kv
以及frequencies
的某些组合以获得我正在寻找的东西......或者如果我可以强迫它变得懒惰的话,可能会doc_id
。< / p>
我可以保证它会被分类;我将把(通过sql)的顺序放在doc_seq
和doc_id
上,所以这个程序唯一负责的是aggregate / string-concat部分。我可能会为整个序列提供大量输入数据,但该序列中的特定doc_seq
应该只有几十个date
。
任何提示赞赏,
答案 0 :(得分:4)
partition-by
是懒惰的,只要每个 doc序列适合内存,这应该可行:
(defn collapse-docs [docs]
(apply merge-with
(fn [l r]
(if (string? r)
(str l r)
r))
docs))
(sequence ;; you may want to use eduction here, depending on use case
(comp
(partition-by :doc_id)
(map collapse-docs))
input-data)
=>
({:doc_id 1,
:doc_seq 4,
:doc_content "this is a very long sentence from a mainframe system that was built before i was born."}
{:doc_id 2, :doc_seq 2, :doc_content "this is a another very long sentence from the same mainframe "}
{:doc_id 3,
:doc_seq 6,
:doc_content "Ok here we are again. The mainframe only had 40 char per field sothey broke it into multiple rows which seems to be common for the time. thanks for your help."})