我有一个文件(exOut.txt),包含以下格式的几千行文字:
<div class="container">
<div class="top">
</div>
<div class="middle">
Content
</div>
<div class="bottom">
</div>
</div>
我正在尝试编写一个shell脚本,它将获取此文件并重新格式化以创建一个csv格式的新文件,只记录带有&#34;得分的行#34;属性。这应该类似于:
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.497312, total=11.0min
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.499232, total=11.0min
[Parallel(n_jobs=-2)]: Done 2 out of 6 | elapsed: 11.0min remaining: 22.0min
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.499762, total=11.1min
[Parallel(n_jobs=-2)]: Done 3 out of 6 | elapsed: 11.1min remaining: 11.1min
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482, score=0.449309, total=19.6min
[Parallel(n_jobs=-2)]: Done 4 out of 6 | elapsed: 19.6min remaining: 9.8min
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482, score=0.449831, total=19.7min
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482, score=0.451609, total=19.7min
[Parallel(n_jobs=-2)]: Done 6 out of 6 | elapsed: 19.7min remaining: 0.0s
[Parallel(n_jobs=-2)]: Done 6 out of 6 | elapsed: 19.7min finished
...
如果可能,所有值都四舍五入到最接近的第1000位。
最终我想通过识别所有字段相同的记录来制作精简版本,除了&#34;得分&#34;,并用一个记录替换那些带有这些参数的平均分数的记录。例如:
solver,penalty,multi_class,max_iter,C,score
sag,l2,multinomial,187.638,0.312,0.497
sag,l2,multinomial,187.638,0.312,0.499
sag,l2,multinomial,187.638,0.312,0.500
newton-cg,l2,ovr,187.638,0.779,0.449
newton-cg,l2,ovr,187.638,0.779,0.450
newton-cg,l2,ovr,187.638,0.779,0.450
任何帮助表示赞赏!我不是正则表达式的专业人士,这主要是我问的原因。
编辑1 感谢您的反馈,请点击此处了解更多信息:
到目前为止,我已尝试使用grep,awk和sed的各种脚本,包括solver,penalty,multi_class,max_iter,C,avg_score
sag,l2,multinomial,187.638,0.312,0.499
newton-cg,l2,ovr,187.638,0.779,0.450
,它只能识别模式的一个大型事件,而不是多个字段,而grep '=.*,' exOut.txt
只能清除每行的第一部分。
答案 0 :(得分:0)
1.您可以参考pipe is being called
pipe has been called
close event has been emitted
callback has been called
copyFile has been called
函数的问题。
2.您可以参考thr parseRawDataFile
函数的问题。
3.在代码中有一些硬编码,请注意。
parseCsvDataFile