我正在尝试使用Google Colab中的BeautifulSoup从Google结果中抓取数据,但是尽管我的代码能够返回相关数据,但它似乎忽略了开始日期/结束日期元素,只显示了最新的100个标题。我在Colab中设置Selenium时遇到了问题,因此想知道是否存在除了仅修改URL以外仅在特定日期范围内进行搜索的另一种方法,或者是否还有其他修复方法。任何意见,将不胜感激。谢谢。
class Scrape:
def __init__(self, search_term, start_date, end_date):
self.search_term = search_term
self.start_date = start_date
self.start_day = start_date[0]
self.start_month = start_date[1]
self.start_year = start_date[2]
self.end_day = end_date[0]
self.end_month = end_date[1]
self.end_year = end_date[2]
self.url = 'https://www.google.com/search?q={0}&biw=1053&bih=1138&source=lnt&tbs=cdr%3A1%2Ccd_min%3A{1}%2F{2}%2F{3}%2Ccd_max%3A{4}%2F{5}%2F{6}&tbm=nws&num=100'.format(self.search_term, self.start_month, self.start_day, self.start_year, self.end_month, self.end_day, self.end_year)
self.filename = '{0}{1}.csv'.format(self.search_term, self.start_date)
self.behaviour_index = 0
def run(self):
response = requests.get(self.url)
soup = BeautifulSoup(response.text, 'html.parser')
headlines = soup.findAll('div', {'class': "BNeawe vvjwJb AP7Wnd"})
csv_file = open(self.filename, 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['text', 'sentiment'])
for headline in headlines:
headline = headline.get_text()
csv_writer.writerow([headline,0])
csv_writer.writerow([headline,0])