用于从字符串

时间:2018-04-25 10:02:25

标签: python regex python-3.x python-2.7 list-comprehension

这是我的字符串,我必须从中提取网址

s = "'0352442':{url:'https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442'},'0370009':{url:'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009'},'0303249':{url:'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249'},'0398568':{url:'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568'},}"

我尝试过的代码到现在才打印

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', s)

但它只会打印此网址的重复

    ['https://www.riteaid.com']

2 个答案:

答案 0 :(得分:1)

正如你所提到的字母像字符串一样,你必须使用正则表达式来处理你的特殊情况。

s = "'0352442':{url:'https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442'},'0370009':{url:'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009'},'0303249':{url:'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249'},'0398568':{url:'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568'},}"

urls = re.findall(r"url:'(https?://.*?)'}", s)

result:
['https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442',
 'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009',
 'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249',
 'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568']

<强>解释

网址:&#39;(http :文字字符串

s?:可选的字面字符&#34; s&#34;

。*?:不贪心任何角色。

&#39;}::文字字符串

答案 1 :(得分:0)

如果您必须使用当前示例的正则表达式在{url:''}之间进行匹配,则可以使用肯定的lookbehind (?<=和积极的预测{{1}并使用与(?=一次或多次匹配的否定字符类[^']+来匹配网址。

(?<={url:')[^']+(?='})

Demo

您还可以减少对示例数据的限制,并忽略前导'和尾随{

(?<=url:')[^']+(?=')