我有一个结构如下的日志文件:
log_type time_stamp kvs
p 2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-0b1\tvl\t20190605.1833\tvt\t20190605.1833\tvs\t20190508
p 2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-03a\tvl\t20190605.1833\tvt\t20190605.1833
p 2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-030
我需要阅读kvs字段,打破键并变成一个单独的列,最终的DataFrame应该看起来像这样:
log_type time_stamp us d h vl vt vs
p 2019-06-05 18:53:20 c us-xx-bb 0b1 20190605.1833 20190605.1833 20190508
p 2019-06-05 18:53:20 c us-xx-bb 03a 20190605.1833 20190605.1833
p 2019-06-05 18:53:20 c us-xx-bb 030
非常重要,KVS中的键数是动态的,键的名称也是动态的
kvs列用\ t分隔。如果我们拆分kvs列,则偶数元素是标头,奇数元素是值。
尝试读取日志文件,使用基于所有字符串的模式创建数据框,然后使用write()函数将数据框转换为HDFS文件,但不知道如何实现
val logSchema = new StructType().add("log_type",StringType).add("time_stamp",StringType).add("kvs",StringType)
val logDF = spark.read.option("delimiter", "\t").format("com.databricks.spark.csv").schema(logSchema).load("/tmp/log.tsv")
I have also tried
logDF.withColumn("pairkv", split($"kvs", "\t")).select(col("pairkv")(1) as "us" ,col("pairkv")(3) as "d" ,col("pairkv")(5) as "h" ,col("pairkv")(7) as "vl" ,col("pairkv")(9) as "vt" ,col("pairkv")(11) as "vs").show()
But no luck
有什么建议吗?
答案 0 :(得分:0)
在scala中,可以按照以下步骤进行操作:
object DataFrames {
case class Person(ID:Int, name:String, age:Int, numFriends:Int)
def mapper(line:String): Person = {
val fields = line.split(',')
val person:Person = Person(fields(0).toInt, fields(1), fields(2).toInt, fields(3).toInt)
return person
}
def main(args: Array[String]) {
....
import spark.implicits._
val lines = spark.sparkContext.textFile("../myfile.csv")
val people = lines.map(mapper).toDS().cache()
....
//here people will be the dataframe...and you can execute your sql queries on this
}
}
答案 1 :(得分:0)
我找到了解决办法
logDF
.withColumn("us", regexp_extract(col("kvs") ,"(^|\\\\t)us\\\\t([\\w]+)",2))
.withColumn("d", regexp_extract(col("kvs") ,"(\\\\t)d\\\\t([\\w-]+)",2))
.withColumn("h", regexp_extract(col("kvs") ,"(\\\\t)h\\\\t([\\w-]+)",2))
.withColumn("vl", regexp_extract(col("kvs") ,"(\\\\t)vl\\\\t([\\w.]+)",2))
.withColumn("vt", regexp_extract(col("kvs") ,"(\\\\t)vt\\\\t([\\w.]+)",2))
.withColumn("vs", regexp_extract(col("kvs") ,"(\\\\t)vs\\\\t([\\w]+)",2))
.show()
这样,我们在DF中有单独的列
答案 2 :(得分:-1)
这里的问题是您有2个定界字符' '
和'\ t'。
我可以看到一个简单的解决方案,即重新格式化输入文件以仅具有一个分隔符。
with open('original_log_file.txt', 'r') as f:
with open('new_lof_file.txt','w') as out:
for line in f:
out.write(line.replace(' ','\t')) #all separators are '\t'
df = pd.read_csv('new_lof_file.txt', delimiter ='\t)
#then fix the header and you are done.
另一种方法是解析文件的每一行,从中创建一个DataFrame,然后将其附加到原始DataFrame。
这里有个例子:
file = '''
p 2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-0b1\tvl\t20190605.1833\tvt\t20190605.1833\tvs\t20190508
p 2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-03a\tvl\t20190605.1833\tvt\t20190605.1833
p 2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-030
'''
columns=['log_type', 'date', 'time', 'us', 'd', 'h', 'vl', 'vt', 'vs']
df = pd.DataFrame({k:[] for k in columns}) #initial df
for line in file.split('\n'):
if len(line):
clean_line = line.strip().replace(' ','\t').replace(' ','\t').split('\t') #fix the line
#remove redundant header
for c in columns:
if c in clean_line:
clean_line.remove(c)
clean_line = [[x] for x in clean_line]
df = df.append(pd.DataFrame(dict(zip(columns,clean_line))),'sort=True')
df = df[columns]
df.head()