基本上,我想保留像'ss'这样的双字母并删除'iiiiiiiiiiiiiiii'
我正在清理数据以进行文本分析。
s_input = "guess who just got shoes boiiiiiiiiiiiiii"
print(''.join(i for i, _ in itertools.groupby(s_input ))) #this also takes out the 'ss' in guess
>"guess who just got shoes boi"
目的是获取以下内容 “猜谁刚穿上鞋子博伊”
注意,“猜测”保留了“ ss”
答案 0 :(得分:1)
您可以这样做:
print(''.join(i if len(g) > 2 else ''.join(g)
for i, g in itertools.groupby(s_input)
for g in [list(g)]))
但是那将是相当糟糕的。
答案 1 :(得分:1)
对于复杂的生成器,我喜欢编写一个生成器函数:
import itertools
def kill_long_dups(s):
for key, group in itertools.groupby(s):
group = list(group)
if len(group) > 2:
yield key
else:
yield from group
s_input = "guess who just got shoes boiiiiiiiiiiiiii"
print(''.join(kill_long_dups(s_input)))
答案 2 :(得分:1)
您可以使用re.sub
完成此操作
返回通过替换最左边的非重叠而获得的字符串 替换repl在字符串中出现模式的情况。如果 找不到模式,字符串不变。代表可以是 字符串或函数;如果是字符串,则其中的任何反斜杠都将转义 已处理。
import re
s_input = "guess who just got shoes boiiiiiiiiiiiiii"
print(re.sub(r'(\w)\1{2,}',r'\1',s_input))
输出:
guess who just got shoes boi