这是html代码:
from bs4 import BeautifulSoup
import urllib2
final_site = 'http://www.careerbuilder.com/jobseeker/jobs/jobresults.aspx?s_rawwords=data+scientist&s_freeloc=San+Francisco%2C+CA'
html = urllib2.urlopen(final_site).read()
soup = BeautifulSoup(html)
num_jobs_area = soup.find('div',{'class':'jobresults_count'}).encode('utf-8')
job_numbers = re.findall('\d+', num_jobs_area)[2]
print job_numbers
这是我从上面的html代码中提取作业数量的代码:
tabPanel("Scores scores",
tabsetPanel("tab",
tabPanel("Score Summary",
selectInput("Num","Select the variable",choices = c("Score scores"=14, "Customer experience score"= 20, "Business experience score" = 21, "Legal experience score" = 22)),
column(width= 12, p("This plot visualizes the number of Nos for each experience")),
sliderInput("bins","Select the BINS of the histgram",min=5, max= 15, value = 10),
plotOutput("myhist"),
selectInput("qt","Select the variable",choices = c("Question 1"=23)),
tableOutput("aud")),
tabPanel("Survey question",
plotOutput("aa",height=200,width=350),
plotOutput("ab",height=200,width=350),
plotOutput("ac",height=200,width=350),
plotOutput("ad",height=200,width=350),
plotOutput("ae"),
plotOutput("af"),
plotOutput("ag"),
plotOutput("ah"),
plotOutput("ai"),
plotOutput("aj"),
plotOutput("ak"),
plotOutput("al"),
plotOutput("am"),
plotOutput("an"),
plotOutput("ao")
)
))
这给我输出为126,但是我想要html代码中提到的输出82以及它在职业生涯网站上显示
答案 0 :(得分:0)
使用Python urllib
时,您尝试抓取的网站会返回不同的结果集。如果您打印html
变量,您会看到源包含:
<div id="n_pnlJobResultsCount" class="jobresults_count">
1 - 25 of 126 <span>Jobs Found</span>
</div>
要模仿真实的浏览器,您可以替换
html = urllib2.urlopen(final_site).read()
与
对齐req = urllib2.Request(final_site, headers={ 'User-Agent': 'Mozilla/5.0' })
html = urllib2.urlopen(req).read()
在这种情况下,您还应该更改
job_numbers = re.findall('\d+', num_jobs_area)[2]
行到
job_numbers = re.findall('\d+', num_jobs_area)[0]
答案 1 :(得分:0)
使用urllib
时,您将收到不同的数据。它收到如下结果:
<div id="n_pnlJobResultsCount" class="jobresults_count">
1 - 25 of 126 <span>Jobs Found</span>
</div>
原因似乎是由于用户代理。你可以通过几种方式解决这对问题。
requests
。import requests
...
html = requests.get(final_site).content
urllib2
req = urllib2.Request(final_site, headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0' })
html = urllib2.urlopen(req).read()
您的job_numbers =
行还有一个小错误。第二个索引上没有元素。将行更改为此,解决问题并打印预期值
job_numbers = re.findall('\d+', num_jobs_area)[0]