Question

我正在处理大型文本数据集，大小约为1 GB（最小的文件大约有200万行）。每行应该被分成许多列。我说假设因为有例外;虽然法线以\r\n结尾，但很多都被错误地划分为2到3行。

鉴于有10列，每行应具有以下格式：

col_1 | col_2 | col_3 | ... | col_10\r\n

例外有这种格式：

1.  col_1 | col_2 | col_3 ...\n
    ... | col_10\r\n

2.  col_1 | col_2 | col_3 ...\n
    ... | col_10\n
    \r\n

纠正这些异常的最快方法是什么？我使用正则表达式(^[^\r\n]*)\n（替换为$1）在1000行的样本上在文本编辑器（TextMate，Mac上）中进行了简单的查找/替换，并且它完美地运行。但是文本编辑器显然无法处理大文件（＆gt; = 200万行）。可以使用等效的正则表达式使用sed或grep（或在其他一些命令行工具中，甚至在Python中）完成这些操作吗？

Answer 1

你的方法：

perl -pe 's/(^[^\r\n]*)\n/\1/' input > output

或者，负面的背后：

perl -pe 's/(?<!\r)\n//' input > output

或者，移除所有\n并将\r替换为\r\n：

perl -pe 's/\n//; s/\r/\r\n/' input > output

Answer 2

为什么不awk？：

awk 'BEGIN{RS="\r\n"; FS="\n"; OFS=" "; ORS="\r\n";} {print $1,$2}' input

或tr + sed：

cat input | tr '\n' ' ' | tr '\r' '\n' | sed 's/^ \(.*\)/\1\r/g'

快速多行正则表达式查找/替换\ r和\ n

2 个答案: