使用bash脚本将文本文件重新格式化为csv

时间:2017-06-06 01:16:46

标签: regex bash shell csv

我有一个文件(exOut.txt),包含以下格式的几千行文字:

<div class="container">
    <div class="top">

    </div>
    <div class="middle">
        Content
    </div>
    <div class="bottom">

    </div>
</div>

我正在尝试编写一个shell脚本,它将获取此文件并重新格式化以创建一个csv格式的新文件,只记录带有&#34;得分的行#34;属性。这应该类似于:

[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405
[CV]  solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.497312, total=11.0min
[CV]  solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.499232, total=11.0min
[Parallel(n_jobs=-2)]: Done   2 out of   6 | elapsed: 11.0min remaining: 22.0min
[CV]  solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.499762, total=11.1min
[Parallel(n_jobs=-2)]: Done   3 out of   6 | elapsed: 11.1min remaining: 11.1min
[CV]  solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482, score=0.449309, total=19.6min
[Parallel(n_jobs=-2)]: Done   4 out of   6 | elapsed: 19.6min remaining: 9.8min
[CV]  solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482, score=0.449831, total=19.7min
[CV]  solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482, score=0.451609, total=19.7min
[Parallel(n_jobs=-2)]: Done   6 out of   6 | elapsed: 19.7min remaining:    0.0s
[Parallel(n_jobs=-2)]: Done   6 out of   6 | elapsed: 19.7min finished
...

如果可能,所有值都四舍五入到最接近的第1000位。

最终我想通过识别所有字段相同的记录来制作精简版本,除了&#34;得分&#34;,并用一个记录替换那些带有这些参数的平均分数的记录。例如:

solver,penalty,multi_class,max_iter,C,score
sag,l2,multinomial,187.638,0.312,0.497
sag,l2,multinomial,187.638,0.312,0.499
sag,l2,multinomial,187.638,0.312,0.500
newton-cg,l2,ovr,187.638,0.779,0.449
newton-cg,l2,ovr,187.638,0.779,0.450
newton-cg,l2,ovr,187.638,0.779,0.450

任何帮助表示赞赏!我不是正则表达式的专业人士,这主要是我问的原因。

编辑1 感谢您的反馈,请点击此处了解更多信息:

到目前为止,我已尝试使用grep,awk和sed的各种脚本,包括solver,penalty,multi_class,max_iter,C,avg_score sag,l2,multinomial,187.638,0.312,0.499 newton-cg,l2,ovr,187.638,0.779,0.450 ,它只能识别模式的一个大型事件,而不是多个字段,而grep '=.*,' exOut.txt只能清除每行的第一部分。

1 个答案:

答案 0 :(得分:0)

1.您可以参考pipe is being called pipe has been called close event has been emitted callback has been called copyFile has been called 函数的问题。

2.您可以参考thr parseRawDataFile函数的问题。

3.在代码中有一些硬编码,请注意。

parseCsvDataFile