如何让我的抓取工具从起始页面解析数据

时间:2017-07-25 21:16:04

标签: python python-3.x web-scraping css-selectors web-crawler

我已经在python中编写了一些代码来从torrent网站获取详细信息。但是,当我运行我的代码时,我发现了我预期的结果。这个爬虫的唯一问题是它跳过了第一页的内容[因为分页网址从2开始],我无法修复。对此的任何帮助都将非常值得注意。

import requests
from lxml import html

page_link = "https://yts.ag/browse-movies"
b_link = "https://yts.ag"

def get_links(main_link):
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    for item in tree.cssselect('ul.tsc_pagination a'):
        if "page" in item.attrib["href"]:
            movie_details(b_link + item.attrib["href"])

def movie_details(link):
    response = requests.get(link).text
    tree = html.fromstring(response)
    for titles in tree.cssselect("div.browse-movie-wrap"):
        title = titles.cssselect('div.browse-movie-bottom a.browse-movie-title')[0].text
        link = titles.cssselect('div.browse-movie-year')[0].text
        rating= titles.cssselect('figcaption.hidden-xs h4.rating')[0].text
        genre = titles.cssselect('figcaption.hidden-xs h4')[0].text
        genre1 = titles.cssselect('figcaption.hidden-xs h4')[1].text
        print(title, link, rating, genre, genre1)

get_links(page_link)

1 个答案:

答案 0 :(得分:1)

为什么不在循环之前调用main_link上的movie_details()函数?

def get_links(main_link):
    response = requests.get(main_link).text
    tree = html.fromstring(response)
    movie_details(main_link)
    for item in tree.cssselect('ul.tsc_pagination a'):
        if "page" in item.attrib["href"]:
            movie_details(b_link + item.attrib["href"])