Question

我只是想知道，我正在尝试进行非常简单的文本处理或缩减。我想将所有空格（" "中没有这些空格）替换为1。我也有一些语义动作依赖于每个字符读取，所以我这就是为什么我不想使用任何正则表达式。这是某种伪FSM模型。

所以这就是交易：

s = '''that's my     string, "   keep these spaces     "    but reduce these '''

所需的输出：

that's my string, "   keep these spaces    " but reduce these

我想做的是这样的事情:(我没有提到'"'案例以保持示例简单）

out = ""
for i in range(len(s)):

  if s[i].isspace():
    out += ' '
    while s[i].isspace():
      i += 1

  else:
    out += s[i]

我不太明白在这种情况下如何创建或共享范围。

感谢您的建议。

Answer 1

使用shlex将您的字符串解析为引用和不引用的部分，然后在不带引号的部分中使用正则表达式将空格序列替换为一个空格。

Answer 2

正如已经建议的那样，我会使用标准shlex模块，并进行一些调整：

import shlex

def reduce_spaces(s):
    lex = shlex.shlex(s)
    lex.quotes = '"'             # ignore single quotes
    lex.whitespace_split = True  # use only spaces to separate tokens
    tokens = iter(lex.get_token, lex.eof)  # exhaust the lexer
    return ' '.join(tokens)

>>> s = '''that's my   string, "   keep these spaces     "   but reduce these '''
>>> reduce_spaces(s)
'that\'s my string, "   keep these spaces     " but reduce these'

Answer 3

我还有一些依赖于每个字符读取的语义动作......它是某种伪FSM模型。

您实际上可以实施FSM：

s = '''that's my     string, "   keep these spaces     "    but reduce these '''


normal, quoted, eating = 0,1,2
state = eating
result = ''
for ch in s:
  if (state, ch) == (eating, ' '):
    continue
  elif (state,ch) == (eating, '"'):
    result += ch
    state = quoted
  elif state == eating:
    result += ch
    state = normal
  elif (state, ch) == (quoted, '"'):
    result += ch
    state = normal
  elif state == quoted:
    result += ch
  elif (state,ch) == (normal, '"'):
    result += ch
    state = quoted
  elif (state,ch) == (normal, ' '):
    result += ch
    state = eating
  else: # state == normal
    result += ch

print result

或者，数据驱动版本：

actions = {
    'normal' : {
        ' ' : lambda x: ('eating', ' '),
        '"' : lambda x: ('quoted', '"'),
        None: lambda x: ('normal', x)
    },
    'eating' : {
        ' ' : lambda x: ('eating', ''),
        '"' : lambda x: ('quoted', '"'),
        None: lambda x: ('normal', x)
    },
    'quoted' : {
        '"' : lambda x: ('normal', '"'),
        '\\': lambda x: ('escaped', '\\'),
        None: lambda x: ('quoted', x)
    },
    'escaped' : {
        None: lambda x: ('quoted', x)
    }
}

def reduce(s):
    result = ''
    state = 'eating'
    for ch in s:
        state, ch = actions[state].get(ch, actions[state][None])(ch)
        result += ch
    return result

s = '''that's my     string, "   keep these spaces     "    but reduce these '''
print reduce(s)

Answer 4

i = iter((i for i,char in enumerate(s) if char=='"'))
zones = list(zip(*[i]*2))  # a list of all the "zones" where spaces should not be manipulated
answer = []
space = False
for i,char in enumerate(s):
    if not any(zone[0] <= i <= zone[1] for zone in zones):
        if char.isspace():
            if not space:
                answer.append(char)
        else:
            answer.append(char)
    else:
        answer.append(char)
    space = char.isspace()

print(''.join(answer))

输出：

>>> s = '''that's my     string, "   keep these spaces     "    but reduce these '''
>>> i = iter((i for i,char in enumerate(s) if char=='"'))
>>> zones = list(zip(*[i]*2))
>>> answer = []
>>> space = False
>>> for i,char in enumerate(s):
...     if not any(zone[0] <= i <= zone[1] for zone in zones):
...         if char.isspace():
...             if not space:
...                 answer.append(char)
...         else:
...             answer.append(char)
...     else:
...         answer.append(char)
...     space = char.isspace()
... 
>>> print(''.join(answer))
that's my string, "   keep these spaces     " but reduce these

Answer 5

这有点像黑客，但你可以用单线程减少到一个空间。

one_space = lambda s : ' '.join([part for part in s.split(' ') if part]

这会加入非空的部分，即它们没有空格字符，由一个空格分隔开来。当然，更难的部分是将双引号中的特殊部分分开。在实际生产代码中，您还需要注意转义双引号等情况。但是假设你只有很有礼貌的情况，你也可以将它们分开。我在实际代码中假设您可能有多个双引号部分。

你可以这样做，从你的字符串中用双引号分隔一个列表，并且只使用一次偶数索引项，并直接附加我认为从工作一些例子的偶数索引项。

def fix_spaces(s):
  dbl_parts = s.split('"')
  normalize = lambda i: one_space(' ', dbl_parts[i]) if not i%2 else dbl_parts[i]
  return ' '.join([normalize(i) for i in range(len(dbl_parts))])

Answer 6

有点担心这个解决方案是否可读。修改了建议在给定字符串中包含多个双引号对的字符串OP。

s = '''that's my     string,   "   keep these spaces     "" as    well    as these    "    reduce these"   keep these spaces too   "   but not these  '''
s_split = s.split('"')

# The substrings in odd positions of list s_split should retain their spaces.
# These elements have however lost their double quotes during .split('"'),
# so add them for new string. For the substrings in even postions, remove 
# the multiple spaces in between by splitting them again using .split() 
# and joining them with a single space. However this will not conserve 
# leading and trailing spaces. In order conserve them, add a dummy 
# character (in this case '-') at the start and end of the substring before 
# the split. Remove the dummy bits after the split.
#
# Finally join the elements in new_string_list to create the desired string.

new_string_list = ['"' + x + '"' if i%2 == 1
                   else ' '.join(('-' + x + '-').split())[1:-1]                   
                   for i,x in enumerate(s_split)]
new_string = ''.join(new_string_list)
print(new_string)

输出

>>>that's my string, "   keep these spaces     "" as    well    as these    " reduce these"   keep these spaces too   " but not these

Python在嵌套for循环中循环遍历字符串

6 个答案: