GCS中的Google Cloud Data Data Prevention(DLP)扫描.parquet文件

时间:2017-09-01 22:41:53

标签: google-api google-cloud-dlp

我是Google Cloud DLP的新手,我运行了一个POST https://dlp.googleapis.com/v2beta1/inspect/operations来扫描Google Cloud Storage目录中的 .parquet 文件,并使用 {{ 1}} 保存 cloudStorageOptions 输出。

.csv 文件为53.93 M。

当我在 .parquet 文件上进行API调用时,我得到了:

.parquet

当我将 "processedBytes": "102308122", "infoTypeStats": [{ "infoType": { "name": "AMERICAN_BANKERS_CUSIP_ID" }, "count": "1" }, { "infoType": { "name": "IP_ADDRESS" }, "count": "17" }, { "infoType": { "name": "US_TOLLFREE_PHONE_NUMBER" }, "count": "148" }, { "infoType": { "name": "EMAIL_ADDRESS" }, "count": "30" }, { "infoType": { "name": "US_STATE" }, "count": "22" }] 文件转换为 .parquet 时,我会收到一个360.58 MB的文件。然后,如果我在 .csv 文件上进行API调用,我会得到:

.csv

显然,当我扫描 "processedBytes": "377530307", "infoTypeStats": [{ "infoType": { "name": "CREDIT_CARD_NUMBER" }, "count": "56546" }, { "infoType": { "name": "EMAIL_ADDRESS" }, "count": "372527" }, { "infoType": { "name": "NETHERLANDS_BSN_NUMBER" }, "count": "5" }, { "infoType": { "name": "US_TOLLFREE_PHONE_NUMBER" }, "count": "1331321" }, { "infoType": { "name": "AUSTRALIA_TAX_FILE_NUMBER" }, "count": "52269" }, { "infoType": { "name": "PHONE_NUMBER" }, "count": "28" }, { "infoType": { "name": "US_DRIVERS_LICENSE_NUMBER" }, "count": "114" }, { "infoType": { "name": "US_STATE" }, "count": "141383" }, { "infoType": { "name": "KOREA_RRN" }, "count": "56144" }], 文件时,与在 .parquet 上运行扫描相比,检测不到所有infoTypes我确认已检测到所有.csv的文件。

我无法找到有关压缩文件(如镶木地板)的任何文档,因此我假设Google Cloud DLP无法提供此功能。

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:1)

Parquet文件当前被扫描为二进制对象,因为系统尚未巧妙地解析它们。在V2 api中,支持的文件类型列在此处https://cloud.google.com/dlp/docs/reference/rpc/google.privacy.dlp.v2#filetype