基于长字符串中的2个字符串标识符的子字符串

时间:2019-06-10 19:41:35

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我有一个简单的要求,即我有一个只有一个字符串字段且字符串值非常大的Dataframe。我只想将其切碎以选择所需的信息。

我的数据框中的字符串字段包含以下值-

Table(tableName:partition_chk, dbName:stage, owner:hive, createTime:1559243466, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:dw_date, type:string, comment:null)], location:hdfs://share/dev/stage/partition_chk, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{line.delim=, field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dw_date, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1559243466}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)

我要从此值获得的所有分区的基本位置是-“ hdfs:// share / dev / stage / partition_chk”

请注意,我只想要上面引用的字符串(不带“ location:”前缀)。任何想法都可以解决pyspark中的操作问题。

谢谢!

Sid

1 个答案:

答案 0 :(得分:1)

有几种方法可以做到这一点,但是我认为正则表达式是最直接的一种。在pyspark中,您需要regexp_extract函数来应用正则表达式并提取匹配组。正则表达式对您来说是下一件重要的事情。以下正则表达式:

location:([a-zA-Z:\/\/_]*)

匹配以下所有字符:

  • 小写字母
  • 大写字母
  • /
  • _

在遇到location:之后。当然,您也可以使用类似location:([^,]*)之类的东西,它匹配location:之后直到第一个逗号的所有内容,但这实际上取决于可能的匹配。下面是一个示例:

from pyspark.sql import functions as F

l = [
(  "Table(tableName:partition_chk, dbName:stage, owner:hive, createTime:1559243466, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:dw_date, type:string, comment:null)], location:hdfs://share/dev/stage/partition_chk, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{line.delim=, field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dw_date, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1559243466}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)"  , )
]

columns = ['hugeString']

df = spark.createDataFrame(l, columns)

#collect() turns the dataframe into a python list of Rows
#I don't know if you need this or not
#In case you want to extract it into a new column, use withColumn instead of select
df.select(F.regexp_extract('hugeString', "location:([a-zA-Z:\/_]*)", 1).alias('match')).collect()[0]['match']

输出:

hdfs://share/dev/stage/partition_chk