使用python /美丽的汤作为Kodi插件从网站上抓取链接

时间:2019-06-07 00:06:15

标签: python web-scraping plugins beautifulsoup kodi

我试图从中获取媒体链接(用于Kodi插件)的网站没有太多的类标记等标记,但是每个链接的布局都是独特的。

我已经从另一个有效的插件创建了基本的Kodi插件,但是在让Python / BeautifulSoup抓取链接时遇到了问题。其他插件使用class等标题,但是我要从中抓取的网站并没有太多使用。

我尝试过各种没有运气的论坛,大多数Kodi插件论坛都比较老旧并且不太活跃。我看过的指南似乎很快就从第1步转到第1000步,它给出的示例并不相关。我查看了30种左右的附加组件,认为它们应该有所帮助,但我无法解决。

我要抓取的媒体链接,剧集标题,描述和图像列在www.thisiscriminal.com/episodes

到目前为止,我所做的全部附加操作都位于Github-repository

我可以从源代码中清楚地看到它们的位置(请参见代码)

我基本上只需要能够解析一个网站,为每个情节找到以下内容,然后将其填充为kodi插件页面上的链接,然后在下方列出下一个。任何帮助将不胜感激。我已经连续三天试图做到这一点,并且非常高兴和恼怒,因为我退出了我从2002年开始的IT学位。

我需要提取的网站代码

(episode image)
<img width="300" height="300" ...
https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png" ../>    

(episode title)
<h3><a href="https://thisiscriminal.com/episode-115-cecilia-5-24-19/">Cecilia</a></h3>

(episode number)
<h4>Episode #115</h4>

(episode link)
<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3"

(episode description)
</header>When Cecilia....</article>

CODE

import requests
import re
from bs4 import BeautifulSoup

def get_soup(url):
    """
    @param: url of site to be scraped
    """
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')

    print "type: ", type(soup)
    return soup

get_soup("https://thisiscriminal.com/episodes")

def get_playable_podcast(soup):
    """
    @param: parsed html page
    """
    subjects = []

    for content in soup.find_all('a'):

        try:
            link = content.find('<p><a href="https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/')
            link = link.get('href')
            print "\n\nLink: ", link

            title = content.find('<h4>Episode ')
            title = title.get_text()

            desc = content.find('div', {'class': 'summary'})
            desc = desc.get_text()


            thumbnail = content.find('img')
            thumbnail = thumbnail.get('src')
        except AttributeError:
            continue


        item = {
                'url': link,
                'title': title,
                'desc': desc,
                'thumbnail': thumbnail
        }

        #needto check that item is not null here
        subjects.append(item)

    return subjects

2019-06-09 00:05:35.719 T:1916360240错误:窗口10502中的控件55已被要求聚焦,但无法 2019-06-09 00:05:41.312 T:1165988576错误:异常抛出(PythonToCppException):-> Python回调/脚本返回以下错误<--注意:忽略此内容可能导致内存泄漏! 错误类型: 错误内容:“ ascii”编解码器无法解码位置0的字节0xa0:序数不在范围内(128) 追溯(最近一次通话): 在第44行的“ /home/osmc/.kodi/addons/plugin.audio.abcradionational/addon.py”文件中 desc = soup.get_text()。replace('\ xa0','').replace('\ n','') UnicodeDecodeError:'ascii'编解码器无法解码位置0的字节0xa0:序数不在范围内(128) -> Python脚本错误报告的结尾<- 2019-06-09 00:05:41.636 T:1130349280错误:GetDirectory-获取插件://plugin.audio.abcradionational/时出错 2019-06-09 00:05:41.636 T:1916360240错误:CGUIMediaWindow :: GetDirectory(plugin://plugin.audio.abcradionational/)失败

2 个答案:

答案 0 :(得分:0)

正如杰克指出的那样,页面响应包括进行AJAX调用的JavaScript代码。此代码包含在页面响应中,但未由

执行

虽然可以为您呈现此图片,但我建议您使用其他方法。

使用任何浏览器(显示为Chrome)导航到该页面。按 F12 打开开发人员工具

Developer Tools Open

我们对“网络”标签感兴趣。也选择XHR。现在,开发人员工具已打开,按 Ctrl + R 重新加载页面并记录XHR请求。

您应该看到类似这样的内容:

Dev Tools XHR

您可以检查每个。我认为您会对/ episodes端点感兴趣:

Preview

这是结构化的,更具体地说是JSON响应。要利用此端点,您只需使用发出相同的GET请求。

这可以简单地通过以下方式完成:

  1. 右键单击响应
  2. 选择复制->复制为cURL(如果有选择,请选择cURL(Bash))
  3. 将其粘贴到cURL Converter

答案 1 :(得分:0)

好消息是该页面获取了内容的wp json源负载,您可以对此发出简单的xhr。其他答案似乎很好地涵盖了如何找到它。

然后您可以根据需要从json解析信息。文本描述为json中返回的html格式,因此您可以将其传递给bs4并根据需要进行解析。下面的例子。您可以探索与塞西莉亚here相关的json对象,或者将以下内容粘贴到json查看器中:

{'title': 'Cecilia', 'excerpt': {'short': 'When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another...', 'long': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your...", 'full': "When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don't.” Sponsors: Article Visit article.com/criminal to get $50 off your first purchase..."}, 'content': '<p data-pm-context="[]">When Cecilia Gentili was growing up in Argentina, she felt so different from everyone around her that she thought she might be from another planet. “Some of us find our community with our own family and some of us don&#8217;t.”</p>\n<p data-pm-context="[]">Sponsors:</p>\n<p><strong>Article</strong> Visit <a href="http://article.com/criminal">article.com/criminal </a>to get $50 off your first purchase of $100 or more.</p>\n<p><a href="https://www.therealreal.com/"><strong>The Real Real</strong></a> Shop in-store, online, or download the app, and get 20% off select items with the promo code REAL.</p>\n<p><strong>Simplisafe</strong> Protect your home today and get free shipping at <a href="http://SimpliSafe.com/CRIMINAL">SimpliSafe.com/CRIMINAL</a></p>\n<p><strong>Squarespace</strong> Try <a href="http://Squarespace.com/criminal">Squarespace.com/criminal </a>for a free trial and when you’re ready to launch, use the offer code INVISIBLE to save 10% off your first purchase of a website or domain.</p>\n<p><strong>Sun Basket</strong> Go to <a href="http://sunbasket.com/criminal">sunbasket.com/criminal </a>to get up to $80 off today!</p>\n', 'image': {'thumb': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-150x150.png', 'medium': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-300x300.png', 'large': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art-1024x1024.png', 'full': 'https://thisiscriminal.com/wp-content/uploads/2019/05/Cecilia_art.png'}, 'episodeNumber': '115', 'audioSource': 'https://dts.podtrac.com/redirect.mp3/dovetail.prxu.org/criminal/a91a9494-fb45-48c5-ad4c-2615bfefd81b/Episode_115_Cecilia_Part_1.mp3', 'musicCredits':"FALSE", 'id': 3129, 'slug': 'episode-115-cecilia-5-24-19', 'date': '2019-05-24 19:43:44', 'permalink': 'https://thisiscriminal.com/episode-115-cecilia-5-24-19/', 'next':"None", 'prev': {'slug': 'episode-114-philip-and-becky', 'title': 'Episode 114: Philip and Becky (5.10.2019)'}}

该请求是一个queryString url,因此您可以更改要返回的项目数,并且在响应中您将看到列出的页面总数,因此您知道需要多少个请求才能返回所有内容。

如果你在这里

posts=1000&page=1

您会看到两个可以相应更改的参数。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=1000&page=1').json()

for post in r['posts']:
    title = post['title']
    soup = bs(post['content'])
    desc = soup.select_one('p').text  # soup.get_text().replace('\xa0', ' ').replace('\n', ' ')
    img = post['image']['full']
    episode_link = post['audioSource'] #sure this is what you wanted?
    episode_number = post['episodeNumber']