Question

我有一个简单的要求，即我有一个只有一个字符串字段且字符串值非常大的Dataframe。我只想将其切碎以选择所需的信息。

我的数据框中的字符串字段包含以下值-

Table(tableName:partition_chk, dbName:stage, owner:hive, createTime:1559243466, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:dw_date, type:string, comment:null)], location:hdfs://share/dev/stage/partition_chk, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{line.delim=, field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dw_date, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1559243466}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)

我要从此值获得的所有分区的基本位置是-“ hdfs：// share / dev / stage / partition_chk”

请注意，我只想要上面引用的字符串（不带“ location：”前缀）。任何想法都可以解决pyspark中的操作问题。

谢谢！

Sid

Answer 1

有几种方法可以做到这一点，但是我认为正则表达式是最直接的一种。在pyspark中，您需要regexp_extract函数来应用正则表达式并提取匹配组。正则表达式对您来说是下一件重要的事情。以下正则表达式：

location:([a-zA-Z:\/\/_]*)

匹配以下所有字符：

小写字母
大写字母
：
/
_

在遇到location:之后。当然，您也可以使用类似location:([^,]*)之类的东西，它匹配location:之后直到第一个逗号的所有内容，但这实际上取决于可能的匹配。下面是一个示例：

from pyspark.sql import functions as F

l = [
(  "Table(tableName:partition_chk, dbName:stage, owner:hive, createTime:1559243466, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:name, type:string, comment:null), FieldSchema(name:dw_date, type:string, comment:null)], location:hdfs://share/dev/stage/partition_chk, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{line.delim=, field.delim=,, serialization.format=,}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:dw_date, type:string, comment:null)], parameters:{EXTERNAL=TRUE, transient_lastDdlTime=1559243466}, viewOriginalText:null, viewExpandedText:null, tableType:EXTERNAL_TABLE)"  , )
]

columns = ['hugeString']

df = spark.createDataFrame(l, columns)

#collect() turns the dataframe into a python list of Rows
#I don't know if you need this or not
#In case you want to extract it into a new column, use withColumn instead of select
df.select(F.regexp_extract('hugeString', "location:([a-zA-Z:\/_]*)", 1).alias('match')).collect()[0]['match']

输出：

hdfs://share/dev/stage/partition_chk

基于长字符串中的2个字符串标识符的子字符串

1 个答案: