从html页面中提取链接

时间:2016-07-11 04:03:24

标签: python html

我正在尝试从此处http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html获取所有电影/ show netflix链接以及他们的国家/地区名称。例如,从页面源,我想http://www.netflix.com/WiMovie/80048948,美国等。我做了以下。但它返回所有链接而不是我想要的netflix。我对正则表达式有点新意。我应该怎么做呢?

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen('http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html')
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    ##reqlink = re.search('netflix',link.get('href'))
    ##if reqlink:
    print link.get('href')

for link in soup.findAll('img'):
    if link.get('alt') == 'UK' or link.get('alt') == 'USA':
        print link.get('alt')  

如果我取消注释上面的行,我会收到以下错误:

  

TypeError:期望的字符串或缓冲区

我该怎么办?

from BeautifulSoup import BeautifulSoup
import urllib2
import re
import requests

url = 'http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url, stream=True)
count = 1
title=[]
country=[]
for line in r.iter_lines():
    if count == 746:
        urllib2.urlopen('http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html')
        soup = BeautifulSoup(line)
        for link in soup.findAll('a', href=re.compile('netflix')):
            title.append(link.get('href'))

        for link in soup.findAll('img'):
            print link.get('alt')
            country.append(link.get('alt'))

    count = count + 1

print len(title), len(country)  

之前的错误已经过处理。现在唯一需要关注的是有多个国家的电影。如何让他们在一起。
例如10.0地震,链接= http://www.netflix.com/WiMovie/80049286,国家=英国,美国。

4 个答案:

答案 0 :(得分:1)

您的代码可以简化为几个选择

import requests
from bs4 import BeautifulSoup

url = 'http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url)
soup = BeautifulSoup(r.content)

for a in soup.select("a[href*=netflix]"):
    print(a["href"])

对于img:

co = {"UK", "USA"}
for img in soup.select("img[alt]"):
    if img["alt"] in co:
        print(img)

答案 1 :(得分:0)

至于第一个问题 - 对于没有href值的链接失败了。所以不是字符串,而是None

以下作品:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen('http://netflixukvsusa.netflixable.com/2016/
07/complete-alphabetical-list-k-sat-jul-9.html')
soup = BeautifulSoup(html_page)
for link in soup.findAll('a'):
    link_href = link.get('href')
    if link_href:  
        reqlink = re.search('netflix',link_href)       
        if reqlink:
            print link_href       

for link in soup.findAll('img'):
    if link.get('alt') == 'UK' or link.get('alt') == 'USA':
        print link.get('alt')  

至于第二个问题,我建议在电影到它出现的国家列表之间插入字典,然后按照你想要的方式将字符串格式化更容易。

答案 2 :(得分:0)

我认为您可以更轻松地遍历列表行并使用生成器来组合您正在寻找的数据结构(忽略我的代码中的细微差别,我使用的是Python3):

from bs4 import BeautifulSoup
import requests

url = 'http://netflixukvsusa.netflixable.com/2016/07/' \
      'complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url)
soup = BeautifulSoup(r.content)
rows = soup.select('span[class="listings"] tr')


def get_movie_info(rows):
    netflix_url_prefix = 'http://www.netflix.com/'
    for row in rows:
        link = row.find('a',
                        href=lambda href: href and netflix_url_prefix in href)
        if link is not None:
            link = link['href']
        countries = [img['alt'] for img in row('img', class_='flag')]
        yield link, countries


print('\n'.join(map(str, get_movie_info(rows))))

编辑或者如果您正在寻找字典而不是列表:

def get_movie_info(rows):
    output = {}
    netflix_url_prefix = 'http://www.netflix.com/'
    for row in rows:
        link = row.find('a',
                        href=lambda href: href and netflix_url_prefix in href)
        if link is not None:
            name = link.text
            link = link['href']
        countries = [img['alt'] for img in row('img', class_='flag')]
        output[name or 'some_default'] = {'link': link, 'countries': countries}
    return output


print('\n'.join(map(str, get_movie_info(rows).items())))

答案 3 :(得分:0)

url = 'http://netflixukvsusa.netflixable.com/2016/07/complete-alphabetical-list-k-sat-jul-9.html'
r = requests.get(url, stream=True)
count = 1
final=[]
for line in r.iter_lines():
    if count == 746:
        soup = BeautifulSoup(line)
        for row in soup.findAll('tr'):
            url = row.find('a', href=re.compile('netflix'))
            if url:
                t=url.string
                u=url.get('href')
                one=[]
                for country in row.findAll('img'):
                    one.append(country.get('alt'))
                final.append({'Title':t,'Url':u,'Countries':one})
    count = count + 1  

final是最终列表。