我有一个包含多个网址的字符串。我设法使用正则表达式来提取第一个URL,但我真的需要它们。我的脚本到目前为止:
data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
url = data[data.find("https://"):]
url[:url.find('"')]
抱歉 - 上面的脚本没有使用正则表达式,但是我试图这样做的另一种方式。我的正则表达式脚本下面几乎做了同样的事情。我真的不介意我们使用什么,只是想尝试获取所有的URL,因为我的脚本只提取第一个URL。
url=re.search('(https)://.*?\.(jpg)', data)
if url:
print(url.group(0))
我正在抓亚马逊产品 - 这是背景。我还将字符串更新为其中一个实际示例..感谢大家的评论/帮助
答案 0 :(得分:1)
也许这样:
URL_list = [i for i in data.split('"') if 'http' in i]
它没有使用正则表达式,但在此代码中我不认为需要正则表达式。
答案 1 :(得分:1)
您的新示例字符串(来自data[0]
)缺少一个大括号和双引号但在添加之后,您可以使用标准库将其读作JSON。您可能只是错误地复制/粘贴它。
In[2]: data = ['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg":[355,342],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg":[441,425],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg":[500,482],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg":[483,466],"https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg":[399,385]}']
In[3]: import json
In[4]: d = json.loads('{"%s' % data[0])
In[5]: d
Out[5]:
{'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg': [355,
342],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg': [441,
425],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg': [500,
482],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg': [483,
466],
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg': [399,
385]}
In[6]: list(d.keys())
Out[6]:
['https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX342_.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX425_.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX466_.jpg',
'https://images-na.ssl-images-amazon.com/images/I/41M9WbK3MDL._SX385_.jpg']