如何从以'\ t'分隔的字符串键\ tvalue创建数据框

时间:2019-06-26 10:54:49

标签: scala apache-spark apache-spark-sql

我有一个结构如下的日志文件:

log_type    time_stamp  kvs
p   2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-0b1\tvl\t20190605.1833\tvt\t20190605.1833\tvs\t20190508
p   2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-03a\tvl\t20190605.1833\tvt\t20190605.1833
p   2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-030

我需要阅读kvs字段,打破键并变成一个单独的列,最终的DataFrame应该看起来像这样:

log_type    time_stamp us   d   h   vl  vt  vs
p   2019-06-05 18:53:20 c   us-xx-bb    0b1 20190605.1833   20190605.1833   20190508
p   2019-06-05 18:53:20 c   us-xx-bb    03a 20190605.1833   20190605.1833
p   2019-06-05 18:53:20 c   us-xx-bb    030

非常重要,KVS中的键数是动态的,键的名称也是动态的

kvs列用\ t分隔。如果我们拆分kvs列,则偶数元素是标头,奇数元素是值。

尝试读取日志文件,使用基于所有字符串的模式创建数据框,然后使用write()函数将数据框转换为HDFS文件,但不知道如何实现

val logSchema = new StructType().add("log_type",StringType).add("time_stamp",StringType).add("kvs",StringType)
val logDF = spark.read.option("delimiter", "\t").format("com.databricks.spark.csv").schema(logSchema).load("/tmp/log.tsv")

I have also tried 
logDF.withColumn("pairkv", split($"kvs", "\t")).select(col("pairkv")(1) as "us" ,col("pairkv")(3) as "d" ,col("pairkv")(5) as "h" ,col("pairkv")(7) as "vl" ,col("pairkv")(9) as "vt" ,col("pairkv")(11) as "vs").show() 
But no luck 

有什么建议吗?

3 个答案:

答案 0 :(得分:0)

在scala中,可以按照以下步骤进行操作:

object DataFrames {

    case class Person(ID:Int, name:String, age:Int, numFriends:Int)
    def mapper(line:String): Person = {
      val fields = line.split(',')  

      val person:Person = Person(fields(0).toInt, fields(1), fields(2).toInt, fields(3).toInt)
      return person
    }

    def main(args: Array[String]) {
        ....
        import spark.implicits._
        val lines = spark.sparkContext.textFile("../myfile.csv")
        val people = lines.map(mapper).toDS().cache()
        ....
        //here people will be the dataframe...and you can execute your sql queries on this
    }
}

答案 1 :(得分:0)

我找到了解决办法

logDF
.withColumn("us", regexp_extract(col("kvs") ,"(^|\\\\t)us\\\\t([\\w]+)",2))
.withColumn("d", regexp_extract(col("kvs") ,"(\\\\t)d\\\\t([\\w-]+)",2))
.withColumn("h", regexp_extract(col("kvs") ,"(\\\\t)h\\\\t([\\w-]+)",2))
.withColumn("vl", regexp_extract(col("kvs") ,"(\\\\t)vl\\\\t([\\w.]+)",2))
.withColumn("vt", regexp_extract(col("kvs") ,"(\\\\t)vt\\\\t([\\w.]+)",2))
.withColumn("vs", regexp_extract(col("kvs") ,"(\\\\t)vs\\\\t([\\w]+)",2))
.show()

这样,我们在DF中有单独的列

答案 2 :(得分:-1)

这里的问题是您有2个定界字符' '和'\ t'。

我可以看到一个简单的解决方案,即重新格式化输入文件以仅具有一个分隔符。

with open('original_log_file.txt', 'r') as f:
    with open('new_lof_file.txt','w') as out:
        for line in f:
            out.write(line.replace(' ','\t')) #all separators are '\t'

df = pd.read_csv('new_lof_file.txt', delimiter ='\t)
#then fix the header and you are done.

另一种方法是解析文件的每一行,从中创建一个DataFrame,然后将其附加到原始DataFrame。

这里有个例子:

file = '''
p   2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-0b1\tvl\t20190605.1833\tvt\t20190605.1833\tvs\t20190508
p   2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-03a\tvl\t20190605.1833\tvt\t20190605.1833
p   2019-06-05 18:53:20 us\tc\td\tus-xx-bb\th\ti-030
'''

columns=['log_type', 'date', 'time', 'us', 'd', 'h', 'vl', 'vt', 'vs']
df = pd.DataFrame({k:[] for k in columns}) #initial df

for line in file.split('\n'):
    if len(line):
        clean_line = line.strip().replace('   ','\t').replace(' ','\t').split('\t') #fix the line
        #remove redundant header
        for c in columns:
            if c in clean_line:
                clean_line.remove(c)
        clean_line = [[x] for x in clean_line]
        df = df.append(pd.DataFrame(dict(zip(columns,clean_line))),'sort=True')

df = df[columns]
df.head()