如何使用Ruby从Quora中抓取问题的追随者?

时间:2019-03-22 13:40:32

标签: html ruby debugging web-scraping quora

我一直在尝试实施一个项目,以基于主题从Quora中抓取问题,并一直以该资源为基础-https://github.com/Theminijohn/quora-scraper 如本页所示,每个问题的预期关注者都将被提取出来。但是,在我的系统中实现相同功能后,对于每个问题,关注者计数都显示为零,即使它不为零。 Column Follower always has zero value as shown here

负责提取关注者数量的代码是:

follower_count = q.css('.FollowActionItem .icon_action_bar-label span > span:last-child').text.to_i

其他所有功能都按预期运行。我在这里想念什么?

编辑:整个代码段如下:

    require 'rubygems'
require 'ruby-progressbar'
require 'Nokogiri'
require 'csv'
require 'pry'

ENGAGEMENT_THRESHOLD = 5

# init progressbar
progressbar = ProgressBar.create( format:         '%a %bᗧ%i %p%% %t',
                                  progress_mark:  ' ',
                                  remainder_mark: '・')

# parse file
doc = File.open("input.html") { |x| Nokogiri::HTML(x) }
questions = doc.css('.TopicAllQuestionsList .pagedlist_item')

# identifiers
canonical_link = doc.at('link[rel="canonical"]')['href']
topic_name = canonical_link.match(/quora.com\/topic\/(.*)/)[1]

# update progressbar
progressbar.total = questions.count

# prepare csv
unless File.exist?('quora-data.csv')
  CSV.open("quora-data.csv", "w+") do |csv|
    csv << [
      "Topic", "Title", "Followers", "Answers", "Ratio", "Engagement potential",
      "Last action", "Parsed time", "Question link"
    ]
  end
end

questions.each do |q|
  link = "https://www.quora.com" + q.css('a.question_link').attr('href').value
  title = q.css('a.question_link').text.strip
  answer_count = q.css('.QuestionFooter .answer_count_prominent').text.strip.to_i
  follower_count = q.css('.FollowActionItem .icon_action_bar-label span > span:last-child').text.to_i
  ratio = "#{follower_count}/#{answer_count}"

  if answer_count == 0
    take_action = (follower_count >= ENGAGEMENT_THRESHOLD) ? "Yes" : "No"
  else
    take_action = ((follower_count / answer_count) >= ENGAGEMENT_THRESHOLD) ? "Yes" : "No"
  end

  # timestamps
  raw_time = q.css('.QuestionFooter .question_timestamp').text.strip
  last_action = raw_time.include?("Last requested") ? "Requested" : "Followed"

  if raw_time.include?('ago')
    if raw_time.scan(/(\d*)h/).flatten.any?
      hours_ago = raw_time.scan(/(\d*)h/).flatten[0].to_f
      parsed_time = (DateTime.now - (hours_ago / 24)).strftime('%Y-%m-%d')
    elsif raw_time.scan(/(\d*)m/).flatten.any?
      minutes_ago = raw_time.scan(/(\d*)m/).flatten[0].to_f
      parsed_time = (DateTime.now - (1.0 / 24 / 60)).strftime('%Y-%m-%d')
    end
  else
    if raw_time.count("0-9") > 0
      parsed_time = Date.parse(raw_time).strftime("%Y-%m-%d")
    else
      parsed_time =
        (Date.today < Date.parse(raw_time)) ? (Date.parse(raw_time) - 7) : Date.parse(raw_time)
    end
  end

  CSV.open("quora-data.csv", "a+") do |csv|
    csv << [
      topic_name, title, follower_count, answer_count, ratio,
      take_action, last_action, parsed_time, link
    ]
  end

  # move progressbar
  progressbar.increment
end

<!DOCTYPE html>
<!-- saved from url=(0099)file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora.html -->
<html lang="en" class="js-wf-loaded"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><link rel="icon" href="https://qsf.fs.quoracdn.net/-3-images.favicon.ico-26-ae77b637b1e7ed2c.ico"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q-icons.q-icons.woff2-26-9afc20a49e3ef2cf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_regular.woff2-26-7ace3bc4cbe404d9.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_regular_italic.woff2-26-9d81ab3229809d01.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_semibold.woff2-26-b55bf39d9018ace9.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin="anonymous" href="https://qsf.fs.quoracdn.net/-3-fonts.q_serif.q_serif_semibold_italic.woff2-26-4c39f22524232bf2.woff2"><script src="./input_files/sdk.js.download" async="" crossorigin="anonymous"></script><script src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/sdk.js.download" async="" crossorigin="anonymous"></script><script async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/analytics.js.download"></script><script type="text/javascript" async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/widgets.js.download"></script><script type="text/javascript" async="" src="file:///C:/Users/DIGANTA/quora/quora-scraper/All%20Questions%20on%20Data%20Science%20-%20Quora_files/sdk.js(1).download"></script><script type="text/javascript">window.Q = {"fontFamilies": ["q-icons", "q_serif"], "errorSamplingRate": 1.0, "revision": "41e9b4435b78728ddf351e72a6dc45ca9708ebc2", "subdomainSuffix": "quora.com"};window["webpackManifest"] = {"ads_manager": "https://qsc.fs.quoracdn.net/-3-chunk.web.ads_manager.js.out-34-1e09a2ca57288a3c.webpack", "content_widgets": "https://qsc.fs.quoracdn.net/-3-chunk.web.content_widgets.js.out-34-9a6c124eee999cb7.webpack", "dev": "https://qsc.fs.quoracdn.net/-3-chunk.web.dev.js.out-34-5d22ece0a38f03a1.webpack", "internal": "https://qsc.fs.quoracdn.net/-3-chunk.web.internal.js.out-34-2e41b1b9af1f0f88.webpack", "qtext2": "https://qsc.fs.quoracdn.net/-3-chunk.web.qtext2.js.out-34-b3d77df0693a06da.webpack", "main": "https://qsc.fs.quoracdn.net/-3-chunk.web.main.js.out-34-835b38fb05330b9f.webpack", "firebase": "https://qsc.fs.quoracdn.net/-3-chunk.web.firebase.js.out-34-eadc5f3144befc37.webpack", "publisher_dashboard": "https://qsc.fs.quoracdn.net/-3-chunk.web.publisher_dashboard.js.out-34-0c43bcc87e209b23.webpack"};window["webpackChunks"] = ["main"];window["PAGE_IS_MOBILE"] = false;var assetErrs=[];document.addEventListener("DOMContentLoaded",function(e){if(0!==assetErrs.length){var s="assets="+encodeURIComponent(JSON.stringify(assetErrs)),t=new XMLHttpRequest;t.open("POST","/ajax/log_browser_asset_load_error_3RD_PARTY_POST",!0),t.setRequestHeader("Content-Type","application/x-www-form-urlencoded; charset=UTF-8"),t.setRequestHeader("Accept","*/*"),t.send(s.replace(/%20/g,"+"))}}),window.addAssetErr=function(e){e&&assetErrs.push(e)};

完整的HTML文件可在此处找到-https://drive.google.com/file/d/1_X86tq5TTw4ikk-hQ2Ixd13Y_hR4scBg/view?usp=sharing

包含关注者数量信息的HTML:

<div class="FollowActionItem ItemComponent primary_item u-relative"><span id="wVP1Ux4a11"><a class="ui_button ui_button--styled ui_button--FlatStyle ui_button--FlatStyle--gray ui_button--size_regular u-inline-block ui_button--non_link ui_button--supports_icon ui_button--has_icon" href="#" role="button" action_click="QuestionFollow" action_target="{&quot;qid&quot;: 44394942, &quot;type&quot;: &quot;question&quot;}" id="__w2_wVP1Ux4a27_button"><div class="ui_button_inner" id="__w2_wVP1Ux4a27_inner"><div class="ui_button_icon_wrapper u-relative u-flex-inline"><div id="__w2_wVP1Ux4a27_icon"><span class="ui_button_icon" aria-hidden="true"><svg width="24px" height="24px" viewBox="0 0 24 24" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
    <g stroke="none" fill="none" fill-rule="evenodd" stroke-linecap="round">
        <g id="follow" class="icon_svg-stroke" stroke="#666" stroke-width="1.5">
            <path d="M14.5,19 C14.5,13.3369229 11.1630771,10 5.5,10 M19.5,19 C19.5,10.1907689 14.3092311,5 5.5,5" id="lines"></path>
            <circle id="circle" cx="7.5" cy="17" r="2" class="icon_svg-fill" fill="none"></circle>
        </g>
    </g>
</svg></span></div></div><div class="ui_button_label_count_wrapper"><span class="ui_button_label" id="__w2_wVP1Ux4a27_label">Follow</span><span class="ui_button_count" aria-hidden="true" id="__w2_wVP1Ux4a27_count_wrapper"><span class="bullet"> · </span><span class="ui_button_count_inner" id="__w2_wVP1Ux4a27_count">1</span></span></div></div></a></span></div>

0 个答案:

没有答案