Question

我在redshift中有很多路径（作为字符串）：

/foo/bar/abc/keyword/<random_id>/def/ghi
/bar/abc/xyz/lmn/keyword/<another_random_id>/qwe
/bar/keyword/<another_random_id>/tsf/qft

请注意，在随机生成的ID之前总会有一个关键字。我想要做的是清理它们并用通用字符串替换所有id，如：

/foo/bar/abc/keyword/generic_string/def/ghi
/bar/abc/xyz/lmn/keyword/generic_string/qwe
/bar/keyword/generic_string/tsf/qft

我真的不关心这个特定用例的ID。我已经有类似的东西：

select substring(column_with_strings, 0, charindex('keyword/',column_with_strings) + 8)

这是为了获取id之前的所有内容，并且：

select 
substring(column_with_strings,
          len(substring(column_with_strings, 0, charindex('keyword/',column_with_strings) + 9)),
          len(column_with_strings) - len(substring(column_with_strings, 0, charindex('keyword/',column) + 8)))

那就是在那之后得到一切。

必须有更好的方法来实现我想要的。即使上面的代码我被卡住了，因为我不知道如何在第一个'/'之后获取所有内容以摆脱id。

思想？

编辑：该ID不是数字，它是字母数字且长度可变。

Answer 1

如果您正在使用Amazon Redshift，则可以为此创建Python UDF。在Python中处理比在SQL中处理要容易得多。函数的主体看起来像这样：

arr = path.split('/')
for i in range(0,len(arr)):
    if arr[i]=='keyword':
        arr[i+1]='generic_string'
return '/'.join(arr)

有关Python UDF的更多信息：Creating a Scalar UDF

Answer 2

regexp_replace可能是最简单的方法，但效率不高。

regexp_replace(column_with_strings, '(/keyword/).*(/.*$)', '$1generic_string$2')

在redshift中清理字符串

2 个答案: