Hadoop上的Python读取块

时间:2018-08-16 13:20:39

标签: python hadoop hdfs

我有以下问题。我想从hdfs(称为“投诉”的表)中提取数据。我写了下面的脚本,它实际上有效:

import pandas as pd
from hdfs import InsecureClient
import os

file = open ("test.txt", "wb")

print ("Step 1")
client_hdfs = InsecureClient ('http://XYZ')
N = 10
print ("Step 2")
with client_hdfs.read('/user/.../complaint/000000_0') as reader:
    print('new line')
    features = reader.read(1000000)
    file.write(features)
    print('end')
file.close()

我现在的问题是文件夹“ complaint”包含4个文件(我不知道哪种文件类型),并且读操作给我提供了我不能再使用的字节(我将其保存为文本文件测试,看起来像这样: txt_file

在HDFS中,它看起来像这样: hdfs directory

我现在的问题是: 是否有可能以一种有意义的方式将每一列的数据分开?

我只找到了带有.csv文件的解决方案,并且类似地以某种方式停留在这里...:-)

编辑 我对解决方案进行了更改,并尝试了不同的方法,但是这些方法都无法真正起作用。这是更新的代码:

import pandas as pd
from hdfs import InsecureClient
import os
import pypyodbc
import pyspark
from pyspark import SparkConf, SparkContext
from hdfs3 import HDFileSystem
import pyarrow.parquet as pq
import pyarrow as pa
from pyhive import hive


#Step 0: Configurations
#Connections with InsecureClient (this basically works)
#Notes: TMS1 doesn't work because of txt files
#insec_client_tms1 = InsecureClient ('http://some-adress:50070')
insec_client_tms2 = InsecureClient ('http://some-adress:50070')

#Connection with Spark (not working at the moment)
#Error: Java gateway process exited before sending its port number
#conf = SparkConf().setAppName('TMS').setMaster('spark://adress-of-node:7077')
#sc = SparkContext(conf=conf)

#Connection via PyArrow (not working)
#Error: File not found
#fs = pa.hdfs.connect(host='hdfs://node-adress', port =8020)
#print("FS: " + fs)

#connection via HDFS3 (not working)
#The module couldn't be load
#client_hdfs = HDFileSystem(host='hdfs://node-adress', port=8020)

#Connection via Hive (not working)
#no module named sasl -> I tried to install it, but it also fails
#conn = hive.Connection(host='hdfs://node-adress', port=8020, database='deltatest')

#Step 1: Extractions
print ("starting Extraction")
#Create file
file = open ("extraction.txt", "w")


#Extraction with Spark
#text = sc.textFile('/user/hive/warehouse/XYZ.db/baseorder_flags/000000_0')
#first_line = text.first()
#print (first_line)

#extraction with hive
#df = pd.read_sql ('select * from baseorder',conn)
#print ("DF: "+ df)

#extraction with hdfs3
#with client_hdfs.open('/home/deltatest/basedeviation/000000_0') as f:
 #   df = pd.read_parquet(f)


#Extraction with Webclient (not working)
#Error: Arrow error: IOError: seek -> fastparquet has a similar error
with insec_client_tms2.read('/home/deltatest/basedeviation/000000_0') as reader:
   features = pd.read_parquet(reader)
   print (features)
   #features = reader.read()
   #data = features.decode('utf-8', 'replace')
   print("saving data to file")
   file.write(data)
   print('end')

file.close()

0 个答案:

没有答案