我已经自学了网络抓取功能,并想从此处的Google每日搜索趋势中抓取数据:https://trends.google.com/trends/trendingsearches/daily?geo=US 数据将包括每天的搜索关键字,其排名,搜索频率。
我一开始尝试使用rvest库使用R进行抓取,但是提取了命令来抓取空数据。我猜网站的html结构比rvest复杂吗?因此,我想学习可以应用于该网站的更好方法。
我搜索了一些特定于日常搜索抓取的信息,但找不到,因为大多数帖子都关注提取Google趋势数据而不是日常搜索。
从网站或更一般的网站上提取数据的有效方法是什么?我很高兴学习除R之外的任何其他工具,并且具有Python和Javascript的基础知识。如果有人可以给我一个提示,那么我会进行深入研究,但此刻我什至不知道从哪里开始。
谢谢
答案 0 :(得分:4)
Have a look at the HTML using the 'inpect element' tool in firefox.
Essentially, we can see that every element you want to scrape from the webpage can be distinguished easily based on the tooltip :
Given that, we can use selenium to scrape the webpage to retrive this information.
(Install it first with pip3 install -U selenium
and install your favorite webdiver from the links here)
Start a browser and direct it to the google trends page using something similar to
╰─ ipython3
Python 3.7.0 (default, Jun 29 2018, 20:13:13)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from selenium import webdriver
In [2]: browser = webdriver.Firefox()
...: browser.get('https://trends.google.com/trends/trendingsearches/daily?geo=US')
You should now see something similar to this :
Again, using the inspect element tool, get the class of the div that contain every element to scrape :
We need to find the div with a class named feed-list-wrapper
.
In [3]: list_div = browser.find_element_by_class_name("feed-list-wrapper")
In [4]: list_div
Out[4]: <selenium.webdriver.firefox.webelement.FirefoxWebElement (session="b889702e-7e2b-7448-9180-c9fb3d1ff641", element="cad96530-3444-9d4f-a8e8-b7da780f5751")>
Once done, just get the list of the div details :
In [5]: details_divs = list_div.find_elements_by_class_name("details")
And, for example, get the title (you should understand the code by now)
In [6]: for detail_div in details_divs:
...: print(detail_div.find_element_by_class_name("details-top").find_element_by_xpath("div/span/a").text)
...:
Captain Marvel
Celia Barquin Arozamena
Yom Kippur
Lethal White
National Cheeseburger Day 2018
Ind vs HK
Mario Kart
Barcelona
Emilia Clarke
Elementary
Angela Bassett
Lenny Kravitz
Lil Uzi Vert
Handmaid's Tale
Mary Poppins Returns trailer
Hannah Gadsby
Another example, to get the view count :
In [7]: for detail_div in details_divs:
...: title = detail_div.find_element_by_class_name("details-top").find_element_by_xpath("div/span/a").text
...: search_count = detail_div.find_element_by_xpath('..').find_element_by_class_name("search-count-title").text
...: print("Title : {title} \t\t\t Searchs : {search_count}".format(title=title, search_count=search_count))
...:
Title : Captain Marvel Searchs : 500 k+
Title : Celia Barquin Arozamena Searchs : 200 k+
Title : Yom Kippur Searchs : 100 k+
Title : Lethal White Searchs : 50 k+
Title : National Cheeseburger Day 2018 Searchs : 50 k+
Title : Ind vs HK Searchs : 50 k+
Title : Mario Kart Searchs : 50 k+
Title : Barcelona Searchs : 50 k+
Title : Emilia Clarke Searchs : 50 k+
Title : Elementary Searchs : 20 k+
Title : Angela Bassett Searchs : 20 k+
Title : Lenny Kravitz Searchs : 20 k+
Title : Lil Uzi Vert Searchs : 20 k+
Title : Handmaid's Tale Searchs : 20 k+
Title : Mary Poppins Returns trailer Searchs : 20 k+
Title : Hannah Gadsby Searchs : 20 k+
You should get used to selenium quickly. If you have any doubt on the methos used here, here is a link to the selenium docs