Question

我目前正在使用PyHive（Python3.6）将数据读取到Hive群集之外的服务器，然后使用Python进行分析。

执行分析后，我想将数据写回到Hive服务器。在寻找解决方案时，大多数帖子都使用PySpark处理。从长远来看，我们将设置我们的系统以使用PySpark。但是，在短期内，是否有一种方法可以使用Python从群集外部的服务器轻松将数据直接直接写入Hive表？

感谢您的帮助！

Answer 1

您可以使用subprocess模块。

以下功能适用于您已经在本地保存的数据。例如，如果将数据帧保存到csv，则将csv的名称传递到save_to_hdfs中，它将把它扔到hdfs中。我敢肯定有一种方法可以直接抛出数据帧，但这应该可以帮助您入门。

以下是在hdfs中将本地对象output保存到user/<your_name>/<output_name>的示例函数。

  import os
  from subprocess import PIPE, Popen

  def save_to_hdfs(output):
      """
      Save a file in local scope to hdfs.
      Note, this performs a forced put - any file with the same name will be 
      overwritten.
      """
      hdfs_path = os.path.join(os.sep, 'user', '<your_name>', output)
      put = Popen(["hadoop", "fs", "-put", "-f", output, hdfs_path], stdin=PIPE, bufsize=-1)
      put.communicate()

  # example
  df = pd.DataFrame(...)
  output_file = 'yourdata.csv'
  dataframe.to_csv(output_file)
  save_to_hdfs(output_file)
  # remove locally created file (so it doesn't pollute nodes)
  os.remove(output_file)

Answer 2

您要以哪种格式将数据写入配置单元？ Parquet / Avro / Binary或简单的csv /文本格式？根据您在创建配置单元表时使用的Serde的选择，可以使用不同的python库将数据帧首先转换为相应的Serde，将文件存储在本地，然后可以使用save_to_hdfs之类的方法（如下面的@Jared Wilber回答）来将该文件移到hdfs配置单元表位置路径中。

创建配置单元表（默认表或外部表）后，它将从特定的HDFS位置（默认或提供的位置）读取/存储其数据。而且可以直接访问此hdfs位置以修改数据。如果手动更新配置单元表中的数据，则需要记住一些事情-SERDE，PARTITIONS，ROW FORMAT DELIMITED等。

Python中一些有用的Serde库：

镶木地板：https://fastparquet.readthedocs.io/en/latest/
Avro：https://pypi.org/project/fastavro/

Answer 3

进行了一些挖掘，但是我能够找到一种使用sqlalchemy的方法来直接从熊猫数据框创建配置单元表。

from sqlalchemy import create_engine

#Input Information
host = 'username@local-host'
port = 10000
schema = 'hive_schema'
table = 'new_table'


#Execution
engine = create_engine(f'hive://{host}:{port}/{schema}')
engine.execute('CREATE TABLE ' + table + ' (col1 col1-type, col2 col2-type)')
Data.to_sql(name=table, con=engine, if_exists='append')

Answer 4

您可以写回。将df的数据转换为这种格式，就像您一次将多行插入到表中一样，例如insert into table values (first row of dataframe comma separated ), (second row), (third row) ....等；因此您可以插入。

bundle=df.assign(col='('+df[df.col[0]] + ','+df[df.col[1]] +...+df[df.col[n]]+')'+',').col.str.cat(' ')[:-1]

con.cursor().execute('insert into table table_name values'+ bundle)

您已完成。

从外部服务器将Python数据框插入Hive

4 个答案: