需要在Notepad ++中使用正则表达式

时间:2016-10-31 12:37:21

标签: regex notepad++

我是一名正则表达式初学者,需要您的帮助才能在Notepad ++中为我的项目找到合适的正则表达式。我的目标是有一个正则表达式找到&从HTML文档中提取单引号中的一些字符串。我需要一个正则表达式来完成所有操作,我必然会使用Notepad ++。

这里是我的文本文档的结构(不能使用原文,因为它包含机密材料):

{ group: '1', code: '1111', ignored: true, shortDescription: 'This is a short "description", containing commas or quotes', description: '', document: 'documentname.txt', row: '1', original: 'this is the original text', translated: 'this is the translated text', matchRate: {label: "label", value: "value"} } _LF_
{ group: '2', code: '2222', ignored: true, shortDescription: 'This is another short "description", containing commas or quotes', description: '', document: 'documentname.txt', row: '1', original: 'this is the original text', translated: 'this is the translated text', matchRate: {label: "label", value: "value"} } _LF_
{ group: '3', code: '3333', ignored: true, shortDescription: 'This is yet another short "description", containing commas or quotes', description: '', document: 'documentname.txt', row: '1', original: 'this is the original text', translated: 'this is the translated text', matchRate: {label: "label", value: "value"} }

我的文档包含33行,所有内容都是这样的(" LF"最后是换行符)。 " group"," code"等等总是相同的,单引号中的字符串不同,也可能是空的。

我需要提取''''''' (或删除所有其余的),用逗号(或类似的)分隔,以便将它们放在excel文档中。我也需要换行符。

这里我已经做过的事情:我能用单引号找到所有字符串

([^']*+'[^\r\n']*+)

虽然这样,但是在结束单引号之后到下一个开始单引号之后的文本也显示为输出。

我还需要可以删除所有其他文字,包括这些字符串周围的单引号。我无法管理它。结果如下:

'1', '1111', 'This is a short "description", containing commas or quotes' '', 'documentname.txt', '1', 'this is the original text', 'this is the translated text'
'2', '2222', 'This is another short "description", containing commas or quotes' '', 'documentname.txt', '1', 'this is the original text', 'this is the translated text'
'3', '3333', 'This is yet another short "description", containing commas or quotes' '', 'documentname.txt', '1', 'this is the original text', 'this is the translated text'

我还阅读了一些像thisthis这样的正则表达式的线程,我学到了很多东西(正如我所说,初学者在这里讲的......),但我没有设法找到一个解决方案,准确提取我需要的字符串。

如果有人可以帮助我,我会很高兴。非常感谢!

2 个答案:

答案 0 :(得分:0)

您可以通过两个步骤完成:

1

查找:.*?(?:\s'([^']+)'|(_LF_)).*?

替换:$1$2,

2

查找:,_LF_,

替换:\r\n

那会让你:

1, 1111, This is a short "description", containing commas or quotes, documentname.txt, 1, this is the original text, this is the translated text

2, 2222, This is another short "description", containing commas or quotes, documentname.txt, 1, this is the original text, this is the translated text

3, 3333, This is yet another short "description", containing commas or quotes, documentname.txt, 1, this is the original text, this is the translated text, , matchRate: {label: "label", value: "value"} }

然后你只需要修剪,matchRate:{label:“label”,value:“value”}}的最后一个。

只有在每行末尾始终有_LF_时才会有效。

答案 1 :(得分:0)

使用notepad ++ regex查找和替换,确保选择正则表达式模式并取消选中。匹配换行符

<强>编辑: 不捕获项目中的逗号(仅允许单个逗号)

找到[^'\r\n]*(?:'([^'\r\n,]*),?([^'\r\n,]*)'|([\r\n]+))(,(?=.*'))?

替换为\1\2\3\4

它应该低于

1,1111,This is a short "description" containing commas or quotes,,documentname.txt,1,this is the original text,this is the translated text
2,2222,This is another short "description" containing commas or quotes,,documentname.txt,1,this is the original text,this is the translated text
3,3333,This is yet another short "description" containing commas or quotes,,documentname.txt,1,this is the original text,this is the translated text

它只会假设行末端总是有换行符,而实际的\r\n不是_LF _