我正在尝试在python中拆分逗号分隔的字符串。对我来说,棘手的部分是数据中的一些字段本身有一个逗号,它们用引号括起来("
或'
)。生成的拆分字符串也应该删除字段周围的引号。此外,某些字段可能为空。
示例:
hey,hello,,"hello,world",'hey,world'
需要分为5个部分,如下所示
['hey', 'hello', '', 'hello,world', 'hey,world']
任何有关如何在Python中解决上述问题的想法/想法/建议/帮助都将非常感激。
谢谢你, Vish
答案 0 :(得分:9)
听起来你想要CSV模块。
答案 1 :(得分:4)
(编辑:由于re.findall
的工作方式,原始答案在边缘上的空字段出现问题,所以我重构了一下并添加了测试。)
import re
def parse_fields(text):
r"""
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
['hey', 'hello', '', 'hello,world', 'hey,world']
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
['hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(''))
['']
>>> list(parse_fields(','))
['', '']
>>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
['testing', 'quotes not at "the" beginning \'of\' the', 'string']
>>> list(parse_fields('testing,"unterminated quotes'))
['testing', '"unterminated quotes']
"""
pos = 0
exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
while True:
m = exp.search(text, pos)
result = m.group(2)
separator = m.group(3)
yield result
if not separator:
break
pos = m.end(0)
if __name__ == "__main__":
import doctest
doctest.testmod()
(['"]?)
匹配可选的单引号或双引号。
(.*?)
匹配字符串本身。这是一个非贪婪的比赛,在不吃整个弦的情况下尽可能多地匹配。这被分配给result
,这就是我们实际产生的结果。
\1
是一个反向引用,用于匹配我们之前匹配的相同的单引号或双引号(如果有的话)。
(,|$)
匹配分隔每个条目或行尾的逗号。这已分配给separator
。
如果分隔符为假(例如为空),则表示没有分隔符,因此我们位于字符串的末尾 - 我们已完成。否则,我们根据正则表达式完成的位置(m.end(0)
)更新新的起始位置,并继续循环。
答案 2 :(得分:2)
csv模块不会同时处理“和'引用的情况。如果没有提供这种方言的模块,就必须进入解析业务。避免依赖第三方模块,我们可以使用re
模块进行词法分析,使用re.MatchObject.lastindex手机将令牌类型与匹配的模式相关联。
以脚本形式运行时,以下代码将使用Python 2.7和2.2传递所有显示的测试。
import re
# lexical token symbols
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5)
_pattern_tuples = (
(r'"[^"]*"', DQUOTED),
(r"'[^']*'", SQUOTED),
(r",", COMMA),
(r"$", NEWLINE), # matches end of string OR \n just before end of string
(r"[^,\n]+", UNQUOTED), # order in the above list is important
)
_matcher = re.compile(
'(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')',
).match
_toktype = [None] + [i[1] for i in _pattern_tuples]
# need dummy at start because re.MatchObject.lastindex counts from 1
def csv_split(text):
"""Split a csv string into a list of fields.
Fields may be quoted with " or ' or be unquoted.
An unquoted string can contain both a " and a ', provided neither is at
the start of the string.
A trailing \n will be ignored if present.
"""
fields = []
pos = 0
want_field = True
while 1:
m = _matcher(text, pos)
if not m:
raise ValueError("Problem at offset %d in %r" % (pos, text))
ttype = _toktype[m.lastindex]
if want_field:
if ttype in (DQUOTED, SQUOTED):
fields.append(m.group(0)[1:-1])
want_field = False
elif ttype == UNQUOTED:
fields.append(m.group(0))
want_field = False
elif ttype == COMMA:
fields.append("")
else:
assert ttype == NEWLINE
fields.append("")
break
else:
if ttype == COMMA:
want_field = True
elif ttype == NEWLINE:
break
else:
print "*** Error dump ***", ttype, repr(m.group(0)), fields
raise ValueError("Missing comma at offset %d in %r" % (pos, text))
pos = m.end(0)
return fields
if __name__ == "__main__":
tests = (
("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']),
("""\n""", ['']),
("""""", ['']),
("""a,b\n""", ['a', 'b']),
("""a,b""", ['a', 'b']),
(""",,,\n""", ['', '', '', '']),
("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']),
("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']),
)
for text, expected in tests:
result = csv_split(text)
print
print repr(text)
print repr(result)
print repr(expected)
print result == expected
答案 3 :(得分:2)
我捏造了这样的东西。我想这非常多余,但它确实适合我。你必须根据你的规格进行调整:
def csv_splitter(line):
splitthese = [0]
splitted = []
splitpos = True
for nr, i in enumerate(line):
if i == "\"" and splitpos == True:
splitpos = False
elif i == "\"" and splitpos == False:
splitpos = True
if i == "," and splitpos == True:
splitthese.append(nr)
splitthese.append(len(line)+1)
for i in range(len(splitthese)-1):
splitted.append(re.sub("^,|\"","",line[splitthese[i]:splitthese[i+1]]))
return splitted