Python re.sub和re.match不匹配?

时间:2015-06-01 22:26:43

标签: python regex

我正在尝试从文件中删除以下字符串的所有实例:

{ "userID":(some 6 digit number), "array":[]},

特别是,我想找到所有这些子串并用什么都替换它们('')

我开始使用re.match来确保我的表达是正确的:

matchObj = re.match( r'({.*?"array":\[\]\},?)', g)

这样可以正常工作并返回我想要的内容(我将问号放两次以关闭re的贪婪默认值)。但是当我移动到re.sub它匹配字符串的许多部分时我并不期望它匹配。特别是这个表达式:

matchObj = re.match( r'({.*?"array":\[\]\},?)', g)
ggg =  re.sub( r'({.*?"array":\[\]\},?)', '', g)

使用g:

的此值
g = 'fedsgedsgs {"all": [{"userID": 777, "array":[]},azgagaga{"userID": 777, "array":[{"expand":"abs","id":503711372,"sport":18,"start_time":"2015-04-15T16:11:12.000Z","local_start_time":"2015-04-15T17:11:12.000Z","distance":4.281959056854248,"duration":2\
891.0,"speed_avg":5.332083225415182,"speed_max":6.74372,"altitude_min":27.0,"altitude_max":61.0,"ascent":80.0,"descent":86.0},{"expand":"abs","id":470811412,"sport":18,"start_time":"2015-02-11T09:27:10.000Z","local_start_time":"2015-02-\
11T10:27:10.000Z","distance":0.0,"duration":0.0},{"expand":"abs","id":470755226,"sport":18,"start_time":"2015-02-11T09:25:04.000Z","local_start_time":"2015-02-11T10:25:04.000Z","distance":0.0,"duration":0.0,"speed_max":0.0,"altitude_min\
":45.0,"altitude_max":45.0},{"expand":"abs","id":470749841,"sport":18,"start_time":"2015-02-11T09:10:43.000Z","local_start_time":"2015-02-11T10:10:43.000Z","distance":0.7858999967575073,"duration":479.0,"speed_avg":5.90655529922135,"spe\
ed_max":6.82629,"altitude_min":35.0,"altitude_max":57.0,"ascent":45.0,"descent":32.0}]},{"userID": 777, "array":[{"expand":"abs","id":470745921,"sport":0,"start_time":"2015-02-11T09:00:48.000Z","local_start_time":"2015-02-11T15:00:48.00\
0Z","distance":0.0,"duration":15.0,"speed_avg":0.0}]},{"userID": 777, "array":[{"expand":"abs","id":498050248,"sport":2,"start_time":"2015-04-06T14:00:03.000Z","local_start_time":"2015-04-06T19:00:03.000Z","distance":16.55500030517578,"\
duration":2793.51,"speed_avg":21.334450601083514,"speed_max":36.3397,"altitude_min":1.8,"altitude_max":35.5,"ascent":50.7,"descent":61.8},{"expand":"abs","id":498049916,"sport":2,"start_time":"2015-04-06T13:59:35.000Z","local_start_time\
":"2015-04-06T18:59:35.000Z","distance":0.010999999940395355,"duration":10.2,"speed_avg":3.882352920139537,"speed_max":2.072,"altitude_min":8.4,"altitude_max":8.4,"ascent":0.0,"descent":0.0},{"expand":"abs","id":486139822,"sport":2,"sta\
rt_time":"2015-03-15T00:21:08.000Z","local_start_time":"2015-03-15T06:21:08.000Z","distance":23.302000045776367,"duration":3997.54,"speed_avg":20.984705635164357,"speed_max":38.4344,"altitude_min":-7.3,"altitude_max":14.6,"ascent":20.1,\
"descent":42.1},{"expand":"abs","id":486139782,"sport":2,"start_time":"2015-03-15T00:20:50.000Z","local_start_time":"2015-03-15T06:20:50.000Z","distance":0.0,"duration":2.99,"speed_avg":0.0,"speed_max":0.0,"altitude_min":4.8,"altitude_m\
ax":4.8,"ascent":0.0,"descent":0 {"userID": 777, "array":[]}, mmmmmmmm {"userID": 7767, "array":[]}, gggggggg {"userID": 74577, "array":[]}, ggggggggggggggg {"userID": 774447, "array":[]}, hrdshe {"userID": 722277, "array":[]},'

导致ggg的输出

In[37]:   ggg
Out[37]: 'fedsgedsgs azgagaga mmmmmmmm  gggggggg  ggggggggggggggg  hrdshe '

该表达式正在用''

替换此表单的表达式
  { "userID":(some 6 digit number), "array":[lots of json objects printed here.....]},

虽然我想保留这些表达式(非空数组的表达式)。

我尝试从\[\]删除转义密钥,因为我只想匹配" []"但后来我收到一条错误信息,表示我的表情不完整。为什么我将[....stuff....]与垃圾内容匹配?如何才能匹配" []"?

更新

所以这是有效的:

ggg = re.sub(r'"用户ID":[0-9] {6,6},"数组":[]},' ,'发现它',g)

不知何故,贪婪似乎不是问题。如果有人能向我解释为什么上述工作有效而不是原来的尝试,我真的很想知道。

2 个答案:

答案 0 :(得分:1)

re.match()隐式锚定。也就是说:

re.match('foo', content)   # find foo only at the beginning of content

...与...相同

re.match('^foo', content)  # find foo only at the beginning of content

...而:

re.sub('foo', 'bar', content) # replace foo with bar everywhere in content

...隐式未固定,使其行为与

相同
re.search('foo', content) # find foo everywhere in content

...它会在foo中找到content 到处的实例,而不仅仅是在开头。

因此,要使用re.sub()时使用的正则表达式与re.match()的行为方式相同,请添加明确的^锚点。

(顺便说一句 - 尝试以这种方式修改JSON注定要以痛苦和痛苦结束。解析,更新和重新序列化 - 否则你会不必要地打开各种各样的错误。)

答案 1 :(得分:0)

我认为你误解了贪婪与不同意。如果不合适,请不要让你的正则表达式从{到远距离"array":[]},进行匹配。

Ungreedy将匹配距离"array:[]},

您可以将*替换为[^}],以明确禁止您*退出" {}对。{/ p>

但是为什么不使用json.loads加载它,清理并使用json.dumps重写它?你的:}还有一些空格或新线(仍然有效的json)怎么样?