Question

我有一个带行的csv文件，每行以（@）开头，一行中的所有字段都用（;）分隔。其中一个字段包含＆＃34; Text＆＃34; （＆＃34;＆＃34; []＆＃34;＆＃34;）有一些换行符，在将整个csv文件导入excel或access时会产生错误。换行后的文本被视为独立行，而不是遵循表的结构。

@4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; ""[OJO!
la premiacin de los #Oscar, nuestros amigos de @cinencuentro revisan las categoras.
+info: co/plHcfSIfn8]""; 0
@624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; ""[Porque nunca dejamos de amar]""; 0

使用python脚本的任何帮助？或任何其他解决方案......

作为输出我希望有这些行：

@4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; ""[OJO! la premiacin de los #Oscar, nuestros amigos de @cinencuentro revisan las categoras. +info: co/plHcfSIfn8]""; 0
@624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; ""[Porque nunca dejamos de amar]""; 0

任何帮助？我是一个csv文件（54MB），有很多行换行...其他一些行也可以......

Answer 1

您也应该分享您的预期输出。

无论如何，我建议您先清理文件以删除换行符。然后你可以把它读作csv。一个解决方案可以是（我相信有人会提出更好的建议:-)）

清理文件（在linux上）：

Hands

将文件读取为csv （您可以使用任何其他方法阅读）

sed ':a;N;$!ba;s/\n/ /g' input_file | sed "s/ @/\n@/g" > output_file

让我们看看它是否对你有帮助： - ）

Answer 2

您可以搜索后面跟不以“@”开头的行的行，例如# coding=utf8 # the above tag defines encoding for this document and is for Python 2.x compatibility import re regex = r"\r?\n+(?!@\d+;)" test_str = ("@4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; \"\"[OJO!\n" "la premiacin de los #Oscar, nuestros amigos de @cinencuentro revisan las categoras.\n" "+info: co/plHcfSIfn8]\"\"; 0\n" "@624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; \"\"[Porque nunca dejamos de amar]\"\"; 0") subst = " " # You can manually specify the number of replacements by changing the 4th argument result = re.sub(regex, subst, test_str, 0, re.MULTILINE) if result: print (result) # Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.。

以下是从此regex101 demo生成的。它用空格替换这样的线端。您可以将其更改为您喜欢的任何内容。

cellFactory()

删除csv文件中的换行符

2 个答案: