我有一个很大的csv导出,其中列没有对齐,因为有些值意外地放在多个单元格而不是一个单元格中。幸运的是,这些值位于两个独特的字符串之间。我希望使用正则表达式将这些值合并到一个单元格中。样本数据如下:
"apple","NULL","0","0","0",",","1",",","fruit","red","sweet","D$","object"
"horse","NULL","0","0","0",",","1",",","animal","large","tail","D$","object"
"Los Angeles","NULL","0","0","0",",","1",","city","California","smoggy","entertainment","D$","location"
未合并的值在
之后开始"NULL","0","0","0",",","1",",","
未合并的值在
之前结束","D$"
我试图找出一个可以删除","的正则表达式。在合并它们的值之间,所以输出看起来像:
"apple","NULL","0","0","0",",","1",",","fruit,red,sweet","D$","object"
"horse","NULL","0","0","0",",","1",",","animal,large,tail","D$","object"
"Los Angeles","NULL","0","0","0",",","1",",","city,California,smoggy,entertainment","D$","location"
答案 0 :(得分:2)
你可以这样做:
$pattern = '~(?:"NULL","0","0","0",",","1",",","|(?!^)\G)[^"]+\K","(?!D\$)~';
$csv = preg_replace($pattern, ',', $csv);
模式细节:
~ # delimiter
(?:
"NULL","0","0","0",",","1",",","
|
(?!^)\G # anchor for the end of the last match
)
[^"]+ # content between quotes
\K # removes all on the left from match result
"," # ","
(?!D\$) # not followed by D$
~
模式的想法是使用\G
锚点,意思是“字符串的开头”或“最后一场比赛的结束”。我添加(?!^)
以避免第一种情况。
"NULL","0","0","0",",","1",",","
用作第一场比赛的入口点。然后匹配引号之间的内容。由于\K
会从匹配结果中删除左侧的所有内容,因此只会替换","
。
下一个匹配项使用\G
作为入口点,并且连续匹配将继续,直到(?!D\$)
成功。
答案 1 :(得分:0)
我在RegEx中能做的最好的事情就是匹配整个值的值,但不能让它们进入捕获组。这意味着我无法在没有回调函数的情况下进行匹配/替换。根据您的语言,您必须以不同的方式执行此操作,但我将在PHP中显示示例。这是regex:
(?<="NULL","0","0","0",",","1",",)(?:"[^"]+",?)+(?=,"D\$")
首先,我们首先回顾一下((?<=...)
)"NULL","0","0","0",",","1",",
字符串。然后我们使用一个非捕获重复组((?:...)+
)来捕获1 + CSV列。里面的语法匹配"
,后跟1 +非"
个字符,后跟"
和可选的,
。最后,我们向前看((?=...)
)查找结束单词列表的,"D\$"
字符串。
鉴于此字符串:
"apple","NULL","0","0","0",",","1",","fruit","red","sweet","D$","object"
它将匹配:
"fruit","red","sweet"
在PHP中,我使用preg_replace_callback()
遍历每个匹配,然后将","
的所有实例替换为,
。当$csv
等于您的示例数据时,这会为您提供预期的输出。
$csv = preg_replace_callback(
'/(?<="NULL","0","0","0",",","1",",)(?:"[^"]+",?)+(?=,"D\$")/',
function($matches) {
return str_replace('","', ',', reset($matches));
},
$csv
);
输出:
&#34;苹果&#34;&#34; NULL&#34;&#34; 0&#34;&#34; 0&#34;&#34; 0&#34;&# 34;,&#34;&#34; 1&#34;&#34;&#34;水果,红色,甜&#34;&#34; d $&#34;&#34;对象&#34;
&#34;马&#34;&#34; NULL&#34;&#34; 0&#34;&#34; 0&#34;&#34; 0&#34;&# 34;,&#34;&#34; 1&#34;&#34;&#34;动物,大,尾&#34;&#34; d $&#34;&#34;对象&#34;
&#34; Los Angeles&#34;,&#34; NULL&#34;,&#34; 0&#34;,&#34; 0&#34;,&#34; 0&#34;,& #34;&#34;&#34; 1&#34;&#34;&#34;市,加利福尼亚州,烟雾弥漫,娱乐&#34;&#34; d $&#34;,& #34;位置&#34;
注意:我不认为我能够在一个简单的正则表达式替换中执行此操作的原因是因为(据我所知)正则表达式并不擅长捕获X组。例如,如果我们用(?:"([^"]+)",?)+
(在单词[^"]+
周围添加了一个捕获组)替换重复的非捕获组,它仍然只计为1个捕获的组。请参阅this example了解我的意思。您可以从字面上重复该非捕获组,并在第一个可选项后使用?
创建每个组。但是,您必须包含至少,与最大的示例一样多(参见here)。