Question

有人在scala开发中工作过，可以将spark dstream写入到Google云存储下的一个串联文件中。实际上，我尝试了几种方法，但它们都没有起作用，因此我尝试使用基于saveAsNewAPIHadoopFile方法的新方法。任何人都可以确认这种方法允许将dstream写入一个串联文件吗？

我在一开始就使用了这种方法，但是我得到了几个零件文件，而这不是我的目标输出，实际上对于每条消息，我都得到一个零件文件：

import requests
import openpyxl     
from bs4 import BeautifulSoup

wb = openpyxl.Workbook()
ws = wb.active

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get('https://thebarchive.com/b/page/7', headers=headers)
pages_soup = BeautifulSoup(r.text, 'lxml')
row = 2

for mobile_view in pages_soup.find_all(class_='mobile_view'):
    for thread_link in mobile_view.find_all('a', href=True):
        ws.cell(row=row, column=4).value = thread_link['href']
        row += 1

wb.save('db.xlsx')

对于saveAsNewAPIHadoopFile方法，我得到了编译错误，有人知道如何使用它。最好的问候

将Spark Dstream写入Google Cloud Storage下的一个文件

0 个答案: