Question

我正在尝试解析一个如下所示的网站：

library(tidyverse)
Result <- tibble(a=seq(1, 11, 2)) %>%
  mutate(b=lag(a, default = 0)+1) %>% 
  mutate(Prod=cumprod(b)/cumprod(a)) %>%  
  mutate(Sum=cumsum(Prod)) 
Result 
  # A tibble: 6 x 4
      a     b      Prod      Sum
  <dbl> <dbl>     <dbl>    <dbl>
1     1     1 1.0000000 1.000000
2     3     2 0.6666667 1.666667
3     5     4 0.5333333 2.200000
4     7     6 0.4571429 2.657143
5     9     8 0.4063492 3.063492
6    11    10 0.3694084 3.432900

# and some graphical analysis
Result %>% 
  ggplot(aes(as.factor(a), Prod, group=1)) + 
    geom_col(aes(as.factor(a), Sum), alpha=0.4)+ 
    geom_point() + 
    geom_line()

用美味的汤来刮掉这个：`

    <div class="address">
    <div class="hit-company"><a href="https://www.cools.biz/best/celebrities/amy-gold/">Amy  Gold</a></div>
    <div class="speciality hit-speciality">Audiology</div>
    <div class="address hit-address"><i><p translate="no">
    <span class="address-line1">38 Park Drive </span><br>
    <span class="locality">London</span>, <span class="administrative-area">VA</span> <span class="postal-code">22025</span><br>
    </p></i></div>
    <div class="phone hit-phone"><i><a href="tel:+1-xxx-659-xxx">(xxx) 659-xxx</a></i></div>
    <div class="description hit-listing_description hidden-xs"></div>
    <div class="hit-website"><a href="http://coll celebs.com" target="_blank">Visit Website</a></div>
    </div>

尝试使用html5lib，lxml，html.parser。 lxml和html.parser甚至没有拿到div类“hit-company”只有html5lib才有。即使使用html5lib，div也会变空。

当我检查html输出时，我注意到

import os
from urllib.request import Request, urlretrieve, urlopen
from bs4 import BeautifulSoup
req = Request("https://www.urlxxxxxx.com", headers={'User-Agent': 'Mozilla/5.0'}) 
page1 = urlopen(req)
phtml = BeautifulSoup(page1, 'html5lib') print(phtml)
divs = phtml.find_all("div", attrs={"class":"hit-company"})
print('aaaaa-----' + str(divs))`

实际数据由{{paratemer x}}放置。你能帮忙解决这个问题吗？

由于

Answer 1

根据@crossal的评论：

正在抓取的网站（内部网页）已动态生成内容，问题已通过selenium和phantomJS解决。

BeautifulSoup抓取返回没有数据的{{}}

1 个答案: