感谢这个论坛上的优秀人士,我已经成功制作了一个工作脚本,可以从站点提取播客。以下代码可以正常工作,我只需要从下一行代码中提取图像(缩略图),而不是“ soup.find_all”命令中的链接:
def get_playable_podcast(soup):
"""
@param: parsed html page
"""
subjects = []
for content in soup.find_all('item'):
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
title = title.get_text()
desc = content.find('itunes:subtitle')
desc = desc.get_text()
thumbnail = content.find('itunes:image')
thumbnail = thumbnail.get('href')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
subjects.append(item)
return subjects
def compile_playable_podcast(playable_podcast):
"""
@para: list containing dict of key/values pairs for playable podcasts
"""
items = []
for podcast in playable_podcast:
items.append({
'label': podcast['title'],
'path': podcast['url'],
'info': podcast['desc'],
'is_playable': True,})
return items
下面的代码我从字面上下载了图像,我需要将其基本上合并到上面的代码中,并为图像创建引用而不是下载,以便最终可以将其合并到“ subjects.append”部分。任何帮助将不胜感激:
resp = requests.get("https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000000&page=1").json()
df = pd.DataFrame(resp['posts'], columns=['image'])
df['image'] = df['image'].apply(pd.Series)['large'].replace({'"': '\'','""': '\'','"""': '\'' }, regex=True)
Regex_Pattern = r"([^\/]+$)"
for index, row in df.iterrows():
match = re.findall(Regex_Pattern, row['image'])
myfilename = ''.join(match)
print(row['image'])
print(myfilename)
urllib.urlretrieve(row['image'], myfilename)