SED - 多行的正则表达式

时间:2010-12-22 15:41:10

标签: regex bash csv sed

我现在已经坚持了好几个小时,并通过各种不同的工具循环来完成工作。没有成功。如果有人可以帮我解决这个问题,那真是太棒了。

问题在于:

我有一个非常大的CSV文件(400mb +)格式不正确。现在它看起来像这样:

This is a long abstract describing something. What follows is the tile for this sentence."   
,Title1  
This is another sentence that is running on one line. On the next line you can find the title.   
,Title2

你可能会看到标题“,Title1”和“,Title2”实际上应与前面的句子在同一行。然后它看起来像这样:

This is a long abstract describing something. What follows is the tile for this sentence.",Title1  
This is another sentence that is running on one line. On the next line you can find the title.,Title2

请注意,句子的结尾可以包含引号。最后,他们也应该被替换。

这是我到目前为止所提出的:

sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv

这实际上应该完成了将表达式匹配到多行的工作。不幸的是,它没有:)

表达式正在查找句子末尾的点和可选引号以及我想要匹配的换行符。*。

非常感谢。什么工具完成工作并不重要(awk,perl,sed,tr等)。

谢谢, 克里斯

2 个答案:

答案 0 :(得分:18)

sed中的多行并不一定是棘手的,只是它使用了大多数人不熟悉的命令并且具有一定的副作用,比如从下一行用'\分隔当前行' n'当你使用'N'将下一行附加到模式空间时。

无论如何,如果你在以逗号开头的行上匹配来决定是否删除换行符会更容易,这就是我在这里所做的:

sed 'N;/\n,/s/"\? *\n//;P;D' title_csv

输入

$ cat title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence."
,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.
,Title2
also, don't touch this line

输出

$ sed 'N;/\n,/s/"\? *\n//;P;D' title_csv
don't touch this line
don't touch this line either
This is a long abstract describing something. What follows is the tile for this sentence.,Title1
seriously, don't touch this line
This is another sentence that is running on one line. On the next line you can find the title.,Title2
also, don't touch this line

答案 1 :(得分:13)

你的工作经历了一些小改动:

sed -n '1h;1!H;${;g;s/\."\?\n,//g;p;}' inputfile

?需要转义且.与换行符不匹配。

这是另一种不需要使用保留空间的方法:

sed -n '${p;q};N;/\n,/{s/"\?\n//p;b};P;D' inputfile

以下是注释版本:

sed -n '
$          # for the last input line
{
  p;             # print
  q              # and quit
};
N;         # otherwise, append the next line
/\n,/      # if it starts with a comma
{
  s/"\?\n//p;    # delete an optional comma and the newline and print the result
  b              # branch to the end to read the next line
};
P;         # it doesn't start with a comma so print it
D          # delete the first line of the pair (it's just been printed) and loop to the top
' inputfile