我正在开发一个更大的代码,该代码将显示Google报纸搜索结果的链接,然后分析这些链接以查找某些关键字以及上下文和数据。我已经完成了这一部分的所有工作,现在当我尝试迭代结果页面时,我遇到了一个问题。如果没有API,我不知道如何做到这一点,我不知道如何使用。我只需要能够遍历搜索结果的多个页面,然后我可以将我的分析应用到它。似乎有一个简单的解决方案来迭代结果页面,但我没有看到它。
对于解决此问题的方法有什么建议吗?我对Python有点新,并且一直在教自己所有这些抓取技术,所以我不确定我是否只是在这里错过了一些简单的东西。我知道这可能是谷歌限制自动搜索的问题,但即使拉入前100个左右的链接也是有益的。我从常规的Google搜索中看到了这方面的例子,但没有看到Google报纸搜索的例子
这是代码的主体。如果您有任何建议的行,那将是有帮助的。提前谢谢!
def get_page_tree(url):
page = requests.get(url=url, verify=False)
return html.fromstring(page.text)
def find_other_news_sources(initial_url):
forwarding_identifier = '/url?q='
google_news_search_url = "https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro"
google_news_search_tree = get_page_tree(url=google_news_search_url)
other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in google_news_search_tree.xpath('//a//@href') if forwarding_identifier in a_link]
return other_news_sources_links
links = find_other_news_sources("https://www.google.com/search? hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro")
with open('textanalysistest.csv', 'wt') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for row in links:
print(row)
答案 0 :(得分:0)
我正在研究为具有与谷歌类似结构的网站构建解析器(即一堆连续的结果页面,每个结果页面都有一个感兴趣的内容表)。
Selenium包(用于基于页面元素的站点导航)和BeautifulSoup(用于html解析)的组合似乎是收集书面内容的首选武器。你可能会觉得它们也很有用,虽然我不知道谷歌有什么样的防御来阻止刮擦。
使用selenium,beautifulsoup和geckodriver的Mozilla Firefox的可能实现:
from bs4 import BeautifulSoup, SoupStrainer
from bs4.diagnose import diagnose
from os.path import isfile
from time import sleep
import codecs
from selenium import webdriver
def first_page(link):
"""Takes a link, and scrapes the desired tags from the html code"""
driver = webdriver.Firefox(executable_path = 'C://example/geckodriver.exe')#Specify the appropriate driver for your browser here
counter=1
driver.get(link)
html = driver.page_source
filter_html_table(html)
counter +=1
return driver, counter
def nth_page(driver, counter, max_iter):
"""Takes a driver instance, a counter to keep track of iterations, and max_iter for maximum number of iterations. Looks for a page element matching the current iteration (how you need to program this depends on the html structure of the page you want to scrape), navigates there, and calls mine_page to scrape."""
while counter <= max_iter:
pageLink = driver.find_element_by_link_text(str(counter)) #For other strategies to retrieve elements from a page, see the selenium documentation
pageLink.click()
scrape_page(driver)
counter+=1
else:
print("Done scraping")
return
def scrape_page(driver):
"""Takes a driver instance, extracts html from the current page, and calls function to extract tags from html of total page"""
html = driver.page_source #Get html from page
filter_html_table(html) #Call function to extract desired html tags
return
def filter_html_table(html):
"""Takes a full page of html, filters the desired tags using beautifulsoup, calls function to write to file"""
only_td_tags = SoupStrainer("td")#Specify which tags to keep
filtered = BeautifulSoup(html, "lxml", parse_only=only_td_tags).prettify() #Specify how to represent content
write_to_file(filtered) #Function call to store extracted tags in a local file.
return
def write_to_file(output):
"""Takes the scraped tags, opens a new file if the file does not exist, or appends to existing file, and writes extracted tags to file."""
fpath = "<path to your output file>"
if isfile(fpath):
f = codecs.open(fpath, 'a') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
else:
f = codecs.open(fpath, 'w') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
return
在此之后,这只是一个问题:
link = <link to site to scrape>
driver, n_iter = first_page(link)
nth_page(driver, n_iter, 1000) # the 1000 lets us scrape 1000 of the result pages
请注意,此脚本假定您尝试抓取的结果页面是按顺序编号的,并且可以使用&#39; find_element_by_link_text&#39;从已删除页面的html中检索这些数字。有关从页面检索元素的其他策略,请参阅selenium文档here。
另外,请注意您需要下载所依赖的软件包以及selenium需要的驱动程序才能与浏览器通信(在本例中为geckodriver,下载geckodriver,将其放在文件夹中,然后参考可执行文件&#39; executable_path&#39;)
如果您最终使用这些软件包,它可以帮助使用时间包(python本机)分散您的服务器请求,以避免超出允许从您正在抓取的服务器的最大请求数。我并没有最终需要它用于我自己的项目,但请参阅here,对原始问题的第二个答案,对于第四个代码块中使用时间模块的实现示例。
Yeeeeaaaahhh ......如果有更高代表的人可以编辑并添加一些链接到beautifulsoup,selenium和时间文件,那将是很棒的,thaaaanks。