在特定时间范围内使用BeautifulSoup进行网页抓取

时间:2020-07-02 01:06:24

标签: python web-scraping beautifulsoup google-colaboratory data-mining

我正在尝试使用Google Colab中的BeautifulSoup从Google结果中抓取数据,但是尽管我的代码能够返回相关数据,但它似乎忽略了开始日期/结束日期元素,只显示了最新的100个标题。我在Colab中设置Selenium时遇到了问题,因此想知道是否存在除了仅修改URL以外仅在特定日期范围内进行搜索的另一种方法,或者是否还有其他修复方法。任何意见,将不胜感激。谢谢。

class Scrape:

    def __init__(self, search_term, start_date, end_date):
       self.search_term = search_term
       self.start_date = start_date
       self.start_day = start_date[0]
       self.start_month = start_date[1]
       self.start_year = start_date[2]
       self.end_day = end_date[0]
       self.end_month = end_date[1]
       self.end_year = end_date[2]
       self.url = 'https://www.google.com/search?q={0}&biw=1053&bih=1138&source=lnt&tbs=cdr%3A1%2Ccd_min%3A{1}%2F{2}%2F{3}%2Ccd_max%3A{4}%2F{5}%2F{6}&tbm=nws&num=100'.format(self.search_term, self.start_month, self.start_day, self.start_year, self.end_month, self.end_day, self.end_year)
       self.filename = '{0}{1}.csv'.format(self.search_term, self.start_date) 
       self.behaviour_index = 0

    def run(self):

       response = requests.get(self.url)
       soup = BeautifulSoup(response.text, 'html.parser')
       headlines = soup.findAll('div', {'class': "BNeawe vvjwJb AP7Wnd"})
       csv_file = open(self.filename, 'w')
       csv_writer = csv.writer(csv_file)
       csv_writer.writerow(['text', 'sentiment'])

    for headline in headlines:
       headline = headline.get_text()
       csv_writer.writerow([headline,0])
       csv_writer.writerow([headline,0])

0 个答案:

没有答案