用于多个分隔符的Python正则表达式,包括双引号

时间:2017-08-16 02:21:07

标签: python regex parsing

python中的代码使用正则表达式可以执行类似这样的操作

输入:

> https://test.com, 2017-08-14, "This is the title with , and "anything" in it", "This is the paragraph also with , and "anything" in it"

理想输出:

['https://test.com', '2017-08-14', 'This is the title with , and "anything" in it', 'This is the paragraph also with , and "anything" in it']

1 个答案:

答案 0 :(得分:0)

您可以使用多种拆分方法。

vanilla内置split方法接受分隔符作为参数,并将执行写在tin上的内容,将字符串精确地分割为您指定的任何分隔符,并将其作为列表返回。

在您的情况下,您想要的分隔符是“,”但只有逗号不在引号内。在一般情况下你可以这样做:

foo = 'https://test.com, 2017-08-14, "This is the title with , and "anything" in it", "This is the paragraph also with , and "anything" in it"'


print foo.split(',')
#but this has the caveat that you don't have any ','s within your input as those will become delimitation points as well, which you do not want.

在这种特殊情况下你也可以匹配说“,” 但这也会失败,因为你的输入有一个元素title with , and "any,而且会被错误地拆分。

在这种情况下,我们可以使用shlex并使用它的split方法。现在,这种拆分方法将在空白处设置分隔符。

所以,做:

print [_ for _ in shlex.split(foo)]

会给我们更接近我们想要的东西,但不完全是:

>>> ['https://test.com,', '2017-08-14,', 'This is the title with , and anything in it,', 'This is the paragraph also with , and anything in it']

可以看出,它在元素中有令人讨厌的逗号,我们不想要它。

不幸的是,我们无法做到

print [_[:-1] for _ in shlex.split(foo)]

为此会切断'it'中的最后一个't',但我们可以使用内置的字符串

rstrip 

方法

并匹配每个元素末尾的任何逗号:

print [_.rstrip(',') for _ in shlex.split(foo)]

给出输出:

>>> ['https://test.com', '2017-08-14', 'This is the title with , and anything in it', 'This is the paragraph also with , and anything in it']

非常接近我们想要但不完全正确! (错过围绕'任何' - shlex吞噬了这个!)。

但是,我们非常接近,我会为你的作业留下那个小小的花絮,你应该先尝试找到解决方案,就像其他人发布的那样。

资源:

https://www.tutorialspoint.com/python/string_split.htm

https://docs.python.org/2/library/shlex.html

P.S。提示:同样查看csv模块。