Web数据抓取:通过选择下拉菜单python

时间:2020-09-09 09:12:40

标签: python-3.x web-scraping drop-down-menu

我正在尝试从https://www.iplt20.com/teams/sunrisers-hyderabad/squad"获取数据 但面临一个问题,尤其是下拉列表(“ 按年份过滤”)。

i能够在下拉列表(即2020,2019,)中检索名称。等等 但无法检索每个列表元素的数据。

当我们点击按年份过滤列表时,会出现一个下拉列表,然后按季节(年份),更改球员(我们将获得该年份的球员以及摘要)。 我想按季节获取每个球员的数据。 当我们单击下拉列表时,也不会创建新的URL

我找不到任何解决方案。 使用以下python代码从下拉列表中检索季节/年份值。

Python代码

    squad_url= "https://www.iplt20.com/teams/sunrisers-hyderabad/squad"
    driver = webdriver.Chrome(executable_path=".\chromedriver.exe")
    driver.get(squad_url)
    html = driver.page_source
    soup2 = BeautifulSoup(''.join(html), 'html.parser')
    for llist in soup2.find_all("ul",class_="drop-down__dropdown-list"):
        for year in llist.find_all("li"):
            print(year.text)

下拉列表的html代码段如下

<div class="large-squad-list__filter single-filter">
        <div class="stats-table__filter drop-down js-drop-down is-open">
            <div class="drop-down__clickzone js-dropdown-trigger" tabindex="0" role="button"></div>
            <div class="drop-down__label js-drop-down-label">Filter by Year</div>
            <div class="drop-down__current js-drop-down-current">2020</div>
            <ul class="drop-down__dropdown-list js-drop-down-options">
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2020">2020</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2019">2019</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2018">2018</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2017">2017</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2016">2016</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2015">2015</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2014">2014</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2013">2013</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2012">2012</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2011">2011</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2010">2010</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2009">2009</li>
                <li tabindex="0" role="button" class="drop-down__dropdown-list__option" data-option="ipl2008">2008</li>
            </ul>
        </div>
    </div>

2 个答案:

答案 0 :(得分:0)

这可能仅是部分答案,但确实可以得到您想要的(统计信息)。

问题在于,数据是由JS动态加载的。但是,如果您查看流量(请检查开发人员工具->网络),则会看到该请求已发送到API。

您可以获取该URL并解析响应。

这正是代码的作用:

import requests


_ids = [18790, 10192, 7749, 5815, 3957, 2785, 2374, 605]


for _id in _ids:
    url = f"https://cricketapi.platform.iplt20.com/stats/" \
          f"players?teamIds=62&tournamentIds={_id}&scope=TOURNAMENT&pageSize=30"
    response = requests.get(url).json()
    print(f"Printing stats for {response['team']['fullName']}")
    for player in response['stats']['content']:
        print(f"{player['player']['fullName']} - {player['stats']}")

但是,我无法确定tournamentIds的来源。此外,没有2013年的数据。

示例输出(为简便起见,仅输出一部分):

Printing stats for Sunrisers Hyderabad
Yusuf Pathan - [{'matchType': 'AGG', 'battingStats': {'50s': 0, '100s': 0, 'inns': 8, 'm': 10, 'r': '40', 'b': 45, '4s': 1, '6s': 1, 'no': 5, 'hs': '16*', 'sr': '88.88', 'a': '13.33'}, 'bowlingStats': {'bbiw': 0, 'bbir': 0, 'bbmw': 0, 'bbmr': 0, '4w': 0, '5w': 0, '10w': 0, 'inns': 1, 'm': 10, 'b': 6, 'r': 8, 'wb': 0, 'nb': 0, 'd': 1, 'w': 0, '4s': 1, '6s': 0, 'maid': 0, 'wmaid': 0, 'ht': 0, 'a': '-', 'e': '8.00', 'sr': '-', 'o': '1.00'}, 'fieldingStats': {'c': 1, 'ro': 0, 's': 0, 'inns': 1, 'm': 10}}, {'matchType': 'IPLT20', 'battingStats': {'50s': 0, '100s': 0, 'inns': 8, 'm': 10, 'r': '40', 'b': 45, '4s': 1, '6s': 1, 'no': 5, 'hs': '16*', 'sr': '88.88', 'a': '13.33'}, 'bowlingStats': {'bbiw': 0, 'bbir': 0, 'bbmw': 0, 'bbmr': 0, '4w': 0, '5w': 0, '10w': 0, 'inns': 1, 'm': 10, 'b': 6, 'r': 8, 'wb': 0, 'nb': 0, 'd': 1, 'w': 0, '4s': 1, '6s': 0, 'maid': 0, 'wmaid': 0, 'ht': 0, 'a': '-', 'e': '8.00', 'sr': '-', 'o': '1.00'}, 'fieldingStats': {'c': 1, 'ro': 0, 's': 0, 'inns': 1, 'm': 10}},

答案 1 :(得分:0)

经过一整天的努力,我能够做到:

在代码中导入以下内容

### Shiny Inputs
library(shiny)

balancedSliderInput <- function(inputId, value = 0, label = "", 
                                group = "", width = "100%") {
  
  if (label != "")
    label <- paste0('<label class="control-label" for="', inputId, '">', label, '</label>')
  
  balanced_slider_tag <- tagList(
    div(style = paste("width: ", width), class = "all-balanced-slider",
        HTML(label),
        div(id = inputId, class = paste("balanced-slider", group), as.character(value)),
        span(class = "value", "0"),
        HTML("%")
    )
  )
  
  dep <- list(
    htmltools::htmlDependency("balanced_slider", "0.0.2", c(file = "www"),
                              script = c("js/jquery-ui.min.js", "js/balancedSlider.js"),
                              stylesheet = c("css/jquery-ui.min.css")
    )
  )
  
  htmltools::attachDependencies(balanced_slider_tag, dep)
}

updateBalancedSliderInput <- function(session, inputId, value = 0) {
  message <- list(value = value)
  session$sendInputMessage(inputId, message)
}

registerInputHandler("balancedSlider", function(data, ...) {
  if (is.null(data))
    NULL
  else
    data
  
}, force = TRUE)


########## App ------ 
ui <- fixedPage(
  
  actionButton("reset", "Reset", icon = icon("undo-alt")),
  balancedSliderInput("test1", label = "Test1", value = 50),
  balancedSliderInput("test2", label = "Test2", value = 50),
  textOutput("test")
  
)

server <- function(session, input, output) {
  
  test_reactive <- reactive({
    return(input$test1)
  })
  
  output$test <- renderText({
    test <- paste("Sluder 1 is at", test_reactive()[[1]])
    return(test)
  })
  
  observeEvent(input$reset, {
    updateBalancedSliderInput(session, "test1", 50)
    updateBalancedSliderInput(session, "test2", 50)
  })
  
}

shinyApp(ui, server)

下面的Python代码

$(function() {

    $('.balanced-slider').each(function() {
        console.log("Running Log 1")
        var init_value = parseInt($(this).text());

        $(this).siblings('.value').text(init_value);

        $(this).empty().slider({
            value: init_value,
            min: 0,
            max: 100,
            range: "max",
            step: 0.5,
            animate: 0,
            slide: function(event, ui) {
              console.log("Log 10");
                
                // Update display to current value
                $(this).siblings('.value').text(ui.value);

                // Get current total
                var total = ui.value;
                var sibling_count = 0;

                var classes = $(this).attr("class").split(/\s+/);
                var selector = ' .' + classes.join('.');
                //console.log(selector);

                var others = $(selector).not(this);
                others.each(function() {
                    total += $(this).slider("option", "value");
                    sibling_count += 1;
                });

                //console.log(total);

                var delta = total - 100;
                var remainder = 0;
                
                // Update each slider
                others.each(function() {
                    console.log("Running Log 2")
                    var t = $(this);
                    var current_value = t.slider("option", "value");

                    var new_value = current_value - delta / sibling_count;
                    
                    if (new_value < 0) {
                        remainder += new_value;
                        new_value = 0;
                    }

                    t.siblings('.value').text(new_value.toFixed(1));
                    t.slider('value', new_value);

                });


                if(remainder) {
                    var pos_val_count = 0;
                    others.each(function() {
                        if($(this).slider("option", "value") > 0)
                            pos_val_count += 1;
                    });

                    others.each(function() {
                        if($(this).slider("option", "value") > 0) {
                            var t = $(this);
                            var current_value = t.slider("option", "value");

                            var new_value = current_value + remainder / pos_val_count;

                            t.siblings('.value').text(new_value.toFixed(1));
                            t.slider('value', new_value);
                        }
                    });

                }

                
            },
            // fire the callback event for the other sliders
            stop: function(event, ui) {
                var classes = $(this).attr("class").split(/\s+/);
                var selector = '.' + classes.join('.');

                $(selector).not(this).each(function() {
                   $(this).trigger("slidestop");
                });
            }
        });
    });
});

var balancedSliderBinding = new Shiny.InputBinding();

$.extend(balancedSliderBinding, {
  find: function(scope) {
    return $(scope).find(".balanced-slider");
  },

  // The input rate limiting policy
  getRatePolicy: function() {
    return {
      // Can be 'debounce' or 'throttle'
      policy: 'debounce',
      delay: 500
    };
  },

  getType: function() {
    return "balancedSlider";
  },

  getValue: function(el) {
    var obj = {};
    obj[$(el).attr("id")] = $(el).slider("option", "value");
    return obj;
  },

  setValue: function(el, new_value) {
    $(el).slider('value', new_value);
    $(el).siblings('.value').text(new_value);

  },

  subscribe: function(el, callback) {
    $(el).on("slidestop.balancedSliderBinding", function(e) {
      callback(); // add true parameter to enable rate policy
    });
  },
  
  unsubscribe: function(el) {
    $(el).off(".balancedSliderBinding");
  },

  // Receive messages from the server.
  // Messages sent by updateUrlInput() are received by this function.
  receiveMessage: function(el, data) {
    if (data.hasOwnProperty('value'))
      this.setValue(el, data.value);

    $(el).trigger('change');
  },
});

Shiny.inputBindings.register(balancedSliderBinding, "balancedSliderBinding");

以上逻辑仅包含有关如何单击下拉列表的信息。 但与此同时,动态数据正在加载(我已经从driver.page_source输出中确认)

具有请求库的

问题是未加载动态数据。 但是使用硒可以轻松做到这一点。

我添加了睡眠以确保滚动完成。 我在很多地方阅读而不是睡觉可以使用 WebDriverWait ,但是我无法使其工作

我非常确定可以优化版本。 (如果找到一个,请在此处发布)