如何高效读取数据湖文件的元数据

时间:2021-06-16 15:26:44

标签: azure apache-spark pyspark databricks azure-data-lake-gen2

我想在 databricks 脚本中读取数据湖中文件的最后修改日期时间。如果能在从数据湖读取数据时,将其高效地作为列读取,那就完美了。
谢谢:)

enter image description here

2 个答案:

答案 0 :(得分:1)

我们可以使用 Python 代码获取这些详细信息,因为我们没有直接的方法来获取数据湖中文件的修改时间和日期

这是代码

from pyspark.sql.functions import col
from azure.storage.blob import BlockBlobService
from datetime import datetime
import os.path

block_blob_service = BlockBlobService(account_name='account-name', account_key='account-key')
container_name ='container-firstname'
second_conatainer_name ='container-Second'
#block_blob_service.create_container(container_name)
generator = block_blob_service.list_blobs(container_name,prefix="Recovery/")
report_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')


for blob in generator:
    length = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    last_modified = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.last_modified
    file_size = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    line = container_name+'|'+second_conatainer_name+'|'+blob.name+'|'+ str(file_size) +'|'+str(last_modified)+'|'+str(report_time)
    print(line)

有关详细信息,请参阅解决类似问题的 SO 主题。

答案 1 :(得分:1)

关于问题,请参考以下代码

URI = sc._gateway.jvm.java.net.URI
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
conf = sc._jsc.hadoopConfiguration()

conf.set(
  "fs.azure.account.key.<account-name>.dfs.core.windows.net",
  "<account-access-key>")

fs = Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/').getFileSystem(sc._jsc.hadoopConfiguration())

status=fs.listStatus(Path('abfss://<container-name>@<account-name>.dfs.core.windows.net/<file-path>/'))

for i in status:
  print(i)
  print(i.getModificationTime())

enter image description here