我有一个简单的要求,即我有一个只有一个字符串字段且字符串值非常大的Dataframe。我只想将其切碎以选择所需的信息。
我的数据框中的字符串字段包含以下值-
Table(tableName:partition_chk, dbName:stage, owner:hive, createTime:1559243466, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:dw_date, type:string, comment:null)], location:hdfs://share/dev/stage/partition_chk, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{line.delim=, field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dw_date, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1559243466}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)
我要从此值获得的所有分区的基本位置是-“ hdfs:// share / dev / stage / partition_chk”
请注意,我只想要上面引用的字符串(不带“ location:”前缀)。任何想法都可以解决pyspark中的操作问题。
谢谢!
Sid
答案 0 :(得分:1)
有几种方法可以做到这一点,但是我认为正则表达式是最直接的一种。在pyspark中,您需要regexp_extract函数来应用正则表达式并提取匹配组。正则表达式对您来说是下一件重要的事情。以下正则表达式:
location:([a-zA-Z:\/\/_]*)
匹配以下所有字符:
在遇到location:
之后。当然,您也可以使用类似location:([^,]*)
之类的东西,它匹配location:
之后直到第一个逗号的所有内容,但这实际上取决于可能的匹配。下面是一个示例:
from pyspark.sql import functions as F
l = [
( "Table(tableName:partition_chk, dbName:stage, owner:hive, createTime:1559243466, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:dw_date, type:string, comment:null)], location:hdfs://share/dev/stage/partition_chk, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{line.delim=, field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dw_date, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1559243466}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)" , )
]
columns = ['hugeString']
df = spark.createDataFrame(l, columns)
#collect() turns the dataframe into a python list of Rows
#I don't know if you need this or not
#In case you want to extract it into a new column, use withColumn instead of select
df.select(F.regexp_extract('hugeString', "location:([a-zA-Z:\/_]*)", 1).alias('match')).collect()[0]['match']
输出:
hdfs://share/dev/stage/partition_chk