从出现在特定单词之后的字符串中提取值

时间:2019-03-06 12:41:39

标签: regex hive

一个json脚本作为字符串传递,我需要提取AdminSerializer之后的数字值以进行进一步的映射。下面的示例数据:

class AdminSerializer(serializers.ModelSerializer):
    user = AdminUserSerializer() # make sure user_type is read-only in whatever serializer you specify here

    class Meta:
        model = models.Admin
        fields = ('user', 'first_name', 'last_name', 'dob', 'gender')

    def create(self, validated_data):
        user_data = validated_data.pop('user')
        user = models.User.objects.create(**user_data, user_type=constants.Constants.ADMIN)
        admin = models.Admin.objects.create(user=user, **validated_data)
        return admin

这些参数是动态的,因此我无法使用substr函数提取或计数在出现一定数量的特殊字符后无法提取。

3 个答案:

答案 0 :(得分:0)

您的示例中的

JSON格式不正确,它包含多余的]和在关闭}之后的尾巴。对于正确的JSON,您可以使用get_json_object,例如:

select get_json_object(src_json,'$.url.content_id') from
    (
     select '{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25,  "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36], "packager_path": "/opt/bento4"}}' as src_json 
     )s
    ;

结果:

OK
1000231205
Time taken: 21.606 seconds, Fetched: 1 row(s)

答案 1 :(得分:0)

您可以在配置单元中使用regexp_extract函数,并使用匹配的正则表达式从content_id中仅提取数字。

示例:

select regexp_extract(col1,'"content_id":\\s"(\\d+)"',1) from (
select string('{"url": {"phone": "videos/hssportint/hssport/jocaasd/6_3818e20a9e/19098311205/phone", "tv": "/mnt/c81292786e1e368e12144c302007/output/", "sample_aspect_ratio": "1:1", "subsample": 25,  "content_id": "1000231205", "encryption_enabled": false, "non_ad_time_intervals": [2330.68, 2898.36]], "packager_path": "/opt/bento4"}}], "vmaf_path": "/vmaf"}')col1
)t;
+-------------+--+
|     _c0     |
+-------------+--+
| 1000231205  |
+-------------+--+

正则表达式说明:

"content_id":\\s"(\\d+)" //match literal "content_id": + any space + "digit inside quotes"

答案 2 :(得分:0)

通过正则表达式和子字符串函数的组合找到了一种昂贵的方法

substr(split(regexp_extract(message,'content_id([^&]*)'), '"')[3],1) as content_id