我正在尝试解析一个如下所示的网站:
library(tidyverse)
Result <- tibble(a=seq(1, 11, 2)) %>%
mutate(b=lag(a, default = 0)+1) %>%
mutate(Prod=cumprod(b)/cumprod(a)) %>%
mutate(Sum=cumsum(Prod))
Result
# A tibble: 6 x 4
a b Prod Sum
<dbl> <dbl> <dbl> <dbl>
1 1 1 1.0000000 1.000000
2 3 2 0.6666667 1.666667
3 5 4 0.5333333 2.200000
4 7 6 0.4571429 2.657143
5 9 8 0.4063492 3.063492
6 11 10 0.3694084 3.432900
# and some graphical analysis
Result %>%
ggplot(aes(as.factor(a), Prod, group=1)) +
geom_col(aes(as.factor(a), Sum), alpha=0.4)+
geom_point() +
geom_line()
用美味的汤来刮掉这个:`
<div class="address">
<div class="hit-company"><a href="https://www.cools.biz/best/celebrities/amy-gold/">Amy Gold</a></div>
<div class="speciality hit-speciality">Audiology</div>
<div class="address hit-address"><i><p translate="no">
<span class="address-line1">38 Park Drive </span><br>
<span class="locality">London</span>, <span class="administrative-area">VA</span> <span class="postal-code">22025</span><br>
</p></i></div>
<div class="phone hit-phone"><i><a href="tel:+1-xxx-659-xxx">(xxx) 659-xxx</a></i></div>
<div class="description hit-listing_description hidden-xs"></div>
<div class="hit-website"><a href="http://coll celebs.com" target="_blank">Visit Website</a></div>
</div>
尝试使用html5lib,lxml,html.parser。 lxml和html.parser甚至没有拿到div类“hit-company”只有html5lib才有。即使使用html5lib,div也会变空。
当我检查html输出时,我注意到
import os
from urllib.request import Request, urlretrieve, urlopen
from bs4 import BeautifulSoup
req = Request("https://www.urlxxxxxx.com", headers={'User-Agent': 'Mozilla/5.0'})
page1 = urlopen(req)
phtml = BeautifulSoup(page1, 'html5lib') print(phtml)
divs = phtml.find_all("div", attrs={"class":"hit-company"})
print('aaaaa-----' + str(divs))`
实际数据由{{paratemer x}}放置。你能帮忙解决这个问题吗?
由于