我正在尝试连接文档中的句子,但是有些句子被分开,中间有一条空行。例如:
狗在追球后追逐
由其所有者抛出。
球走得很远。
为:
这只狗追了一个球 它的主人抛出。
球走得很远。
我以为我可以搜索一个空行,然后搜索小写字符的下一行的开头。它复制该行,删除它和它上面的空行,然后将复制的句子附加到另一个断句(抱歉混淆)。
我是sed的新手,并尝试使用此命令:
sed "/$/{:a;N;s/\n\(^[a-z]* .*\)/ \1/;ba}"
但只有一次,只删除空行,而不是将断句的后半部分附加到第一部分。
请帮忙。
答案 0 :(得分:1)
这应该可以解决问题:
sed ':a;$!{N;N};s/\n\n\([a-z]\)/ \1/;ta;P;D' sentences
答案 1 :(得分:0)
我第一次使用sed来执行这种复杂的替换。我花了大约2个小时拿出一些东西:D
我使用了GNU sed
,因为我无法在一行上使用我的mac进行分支。
以下是我用于测试的输入内容:
The dog chased after a ball
that was thrown by its owner.
The ball
travelled quite far.
I took me a while to fix this file.
And now it's
working :)
然后这是我提出的sed
命令行:
$ sed -n '/^$/!bstore;/^$/N;s/\n\([a-z]\)/ \1/;tmerge;h;d;:store;H;b;:merge;H;g;s/\n \([a-z]\)/ \1/;p;s/.*//g;h;d' sentences.txt
这是输出:
$ sed -n '/^$/!bstore;/^$/N;s/\n\([a-z]\)/ \1/;tmerge;h;d;:store;H;b;:merge;H;g;s/\n \([a-z]\)/ \1/;p;s/.*//g;h;d' sentences.txt
The dog chased after a ball that was thrown by its owner.
The ball travelled quite far.
I took me a while to fix this file.
And now it's working :)
你可以注意到一开始就插入了一个空行,但我认为可以接受这一行。如果你掌握sed
,请大家评论一下,因为这只是一个新手拍摄。
答案 2 :(得分:0)
如果你有Python,你可以尝试这个片段
import string
f=0
data=open("file").readlines()
alen=len(data)
for n,line in enumerate(data):
if line[0] in string.uppercase:
found_upper=n
f=1
if f and line[0] in string.lowercase:
data[found_upper] = data[found_upper].strip() + " " + line
data[n]=""
if n+1==alen:
if line[0] in string.lowercase:
data[found_upper] = data[found_upper].strip() + " " + line
data[n]=""
else : data[n]=line
输出(添加了更多文件格式的场景)
$ cat file
the start
THE START
The dog chased after a ball
that was thrown by its owner.
My ball travelled quite far
and it smashed the windows
but it didn't cause much damage
THE END
THE FINAL DESTINATION
final
FINAL DESTINATION LAST EPISODE
the final final
$ ./python.py
the start
THE START
The dog chased after a ball that was thrown by its owner.
My ball travelled quite far and it smashed the windows but it didn't cause much damage
THE END
THE FINAL DESTINATION final
FINAL DESTINATION LAST EPISODE the final final the final final