Question

python中的代码使用正则表达式可以执行类似这样的操作

输入：

> https://test.com, 2017-08-14, "This is the title with , and "anything" in it", "This is the paragraph also with , and "anything" in it"

理想输出：

['https://test.com', '2017-08-14', 'This is the title with , and "anything" in it', 'This is the paragraph also with , and "anything" in it']

Answer 1

您可以使用多种拆分方法。

vanilla内置split方法接受分隔符作为参数，并将执行写在tin上的内容，将字符串精确地分割为您指定的任何分隔符，并将其作为列表返回。

在您的情况下，您想要的分隔符是“，”但只有逗号不在引号内。在一般情况下你可以这样做：

foo = 'https://test.com, 2017-08-14, "This is the title with , and "anything" in it", "This is the paragraph also with , and "anything" in it"'


print foo.split(',')
#but this has the caveat that you don't have any ','s within your input as those will become delimitation points as well, which you do not want.

在这种特殊情况下你也可以匹配说“，” 但这也会失败，因为你的输入有一个元素title with , and "any，而且会被错误地拆分。

在这种情况下，我们可以使用shlex并使用它的split方法。现在，这种拆分方法将在空白处设置分隔符。

所以，做：

print [_ for _ in shlex.split(foo)]

会给我们更接近我们想要的东西，但不完全是：

>>> ['https://test.com,', '2017-08-14,', 'This is the title with , and anything in it,', 'This is the paragraph also with , and anything in it']

可以看出，它在元素中有令人讨厌的逗号，我们不想要它。

不幸的是，我们无法做到

print [_[:-1] for _ in shlex.split(foo)]

为此会切断'it'中的最后一个't'，但我们可以使用内置的字符串

rstrip

方法

并匹配每个元素末尾的任何逗号：

print [_.rstrip(',') for _ in shlex.split(foo)]

给出输出：

>>> ['https://test.com', '2017-08-14', 'This is the title with , and anything in it', 'This is the paragraph also with , and anything in it']

非常接近我们想要但不完全正确！（错过“围绕'任何' - shlex吞噬了这个！）。

但是，我们非常接近，我会为你的作业留下那个小小的花絮，你应该先尝试找到解决方案，就像其他人发布的那样。

资源：

https://www.tutorialspoint.com/python/string_split.htm

https://docs.python.org/2/library/shlex.html

P.S。提示：同样查看csv模块。

用于多个分隔符的Python正则表达式，包括双引号

1 个答案: