使用多个URL提取字符串

时间:2018-06-10 16:36:53

标签: python regex

我有一个包含多个网址的字符串。我设法使用正则表达式来提取第一个URL,但我真的需要它们。我的脚本到目前为止:

data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
url = data[data.find("https://"):]
url[:url.find('"')]

抱歉 - 上面的脚本没有使用正则表达式,但是我试图这样做的另一种方式。我的正则表达式脚本下面几乎做了同样的事情。我真的不介意我们使用什么,只是想尝试获取所有的URL,因为我的脚本只提取第一个URL。

url=re.search('(https)://.*?\.(jpg)', data)
if url:
   print(url.group(0)) 

我正在抓亚马逊产品 - 这是背景。我还将字符串更新为其中一个实际示例..感谢大家的评论/帮助

2 个答案:

答案 0 :(得分:1)

也许这样:

URL_list = [i for i in data.split('"') if 'http' in i]

它没有使用正则表达式,但在此代码中我不认为需要正则表达式。

答案 1 :(得分:1)

您的新示例字符串(来自data[0])缺少一个大括号和双引号但在添加之后,您可以使用标准库将其读作JSON。您可能只是错误地复制/粘贴它。

In[2]: data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
In[3]: import json
In[4]: d = json.loads('{"%s' % data[0])
In[5]: d
Out[5]: 
{'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg': [355,
  342],
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg': [441,
  425],
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg': [500,
  482],
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg': [483,
  466],
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg': [399,
  385]}
In[6]: list(d.keys())
Out[6]: 
['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg',
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg',
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg',
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg',
 'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg']