使用beautifulsoup解析多个网站

时间:2019-07-05 06:05:28

标签: python pandas parsing beautifulsoup urllib2

感谢这个论坛上的优秀人士,我已经成功制作了一个工作脚本,可以从站点提取播客。以下代码可以正常工作,我只需要从下一行代码中提取图像(缩略图),而不是“ soup.find_all”命令中的链接:

def get_playable_podcast(soup):
    """
    @param: parsed html page            
    """
    subjects = []

    for content in soup.find_all('item'):

        try:        
            link = content.find('enclosure')
            link = link.get('url')
            print "\n\nLink: ", link

            title = content.find('title')
            title = title.get_text()

            desc = content.find('itunes:subtitle')
            desc = desc.get_text()

            thumbnail = content.find('itunes:image')
            thumbnail = thumbnail.get('href')

        except AttributeError:
            continue

    item = {
            'url': link,
            'title': title,
            'desc': desc,
            'thumbnail': thumbnail
    }

    subjects.append(item) 

return subjects

def compile_playable_podcast(playable_podcast):
"""
@para: list containing dict of key/values pairs for playable podcasts
"""
items = []

for podcast in playable_podcast:
    items.append({
        'label': podcast['title'],
        'path': podcast['url'],
        'info': podcast['desc'],
        'is_playable': True,})

return items

下面的代码我从字面上下载了图像,我需要将其基本上合并到上面的代码中,并为图像创建引用而不是下载,以便最终可以将其合并到“ subjects.append”部分。任何帮助将不胜感激:

resp = requests.get("https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000000&page=1").json()
df = pd.DataFrame(resp['posts'], columns=['image'])
df['image'] = df['image'].apply(pd.Series)['large'].replace({'"': '\'','""': '\'','"""': '\'' }, regex=True)
Regex_Pattern = r"([^\/]+$)"

        for index, row in df.iterrows():
            match = re.findall(Regex_Pattern, row['image'])
            myfilename = ''.join(match)
            print(row['image'])
            print(myfilename)
            urllib.urlretrieve(row['image'], myfilename)

0 个答案:

没有答案