我的一个站点前一阵子离线了,我需要恢复图像。我设法编写了一些Python,可以使用Beautiful Soup从脚本标签中提取代码。现在,我需要从提取的文本中解析一些网址。所需的网址与"large"
图片相关。我不确定如何合并所有图像的循环,而不仅仅是第一幅图像并去除语音标记。任何帮助将不胜感激
提取文字:
var gallery_items = [{
"type": "image",
"medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg",
"medium-height": 267,
"medium-width": 400,
"large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg",
"large-height": 450,
"large-width": 675,
"awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg",
"caption": ""
}, {
"type": "image",
"medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg",
"medium-height": 267,
"medium-width": 400,
"large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg",
"large-height": 450,
"large-width": 675,
"awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg",
"caption": ""
}];
Python脚本
from bs4 import BeautifulSoup
import urllib.request as request
import re
folder = r'./gallery'
URL = 'https://web.archive.org/web/20180324152250/http://www.example.com:80/project/test-museum-visitors-center/'
response = request.urlopen(URL)
soup = BeautifulSoup(response, 'html.parser')
scriptCnt = soup.find('div', {'class': 'posts-wrapper'})
script = scriptCnt.find('script').text
try:
found = re.search('"large":(.+?)"', script).group(1)
except AttributeError:
found = 'None Found!'
print(found)
输出
"https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg
答案 0 :(得分:1)
给定的数据为JSON格式,可以轻松地使用Python的JSON库进行解析。 您需要做的就是仔细地单独提取JSON并将其提供给JSON解析器。该代码可能类似于
import json
script_str = '''var gallery_items = [{ "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5755.jpg", "caption": "" }, { "type": "image", "medium": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-400x267.jpg", "medium-height": 267, "medium-width": 400, "large": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715-675x450.jpg", "large-height": 450, "large-width": 675, "awp-gallery": "https:\/\/web.archive.org\/web\/20180324152250\/http:\/\/www.example.com\/wp-content\/uploads\/2017\/06\/test_hhf_5715.jpg", "caption": "" }];'''
json_str = script_str[str(script_str).find('var gallery_items = '):str(script_str).find(';')].replace('var gallery_items = ', '')
json_str = json.loads(json_str)
for item in json_str:
print(item['large'])
希望这会有所帮助!干杯!