Question

最近几天，我花了很多时间来学习如何构建数据科学项目，以使其保持简单，可重用和pythonic。坚持this guideline，我创建了my_project。您可以在下面看到它的结构。

├── README.md          
├── data
│   ├── processed          <-- data files
│   └── raw                            
├── notebooks  
|   └── notebook_1                             
├── setup.py              
|
├── settings.py            <-- settings file   
└── src                
    ├── __init__.py    
    │
    └── data           
        └── get_data.py    <-- script

我定义了一个从.data/processed加载数据的函数。我想在其他脚本以及.notebooks中的jupyter笔记本中使用此功能。

def data_sample(code=None):
    df = pd.read_parquet('../../data/processed/my_data')
    if not code:
        code = random.choice(df.code.unique())
    df = df[df.code == code].sort_values('Date')
    return df

很明显，除非我直接在定义该脚本的脚本中运行它，否则该函数将无法在任何地方使用。我的想法是在要声明的地方创建settings.py

from os.path import join, dirname

DATA_DIR = join(dirname(__file__), 'data', 'processed')

现在我可以写：

from my_project import settings
import os

def data_sample(code=None):
    file_path = os.path.join(settings.DATA_DIR, 'my_data')
    df = pd.read_parquet(file_path)
    if not code:
        code = random.choice(df.code.unique())
    df = df[df.code == code].sort_values('Date')
    return df

问题：

这是通常的方式以这种方式引用文件吗？ settings.DATA_DIR看起来很丑。
这到底是应该如何使用settings.py？并将其放置在此目录中吗？我在.samr/settings.py下的repo

我了解可能没有“一个正确的答案”，我只是想找到处理这些问题的逻辑，优雅的方法。

Answer 1

只要您不提交大量数据，并且可以弄清不受控制的外部环境快照和您自己的派生数据（代码+ {raw）==状态之间的区别。有时使用仅追加ish raw并考虑诸如raw/interesting_source/2018.csv.gz -> raw_appendonly/interesting_source/2018.csv.gz.20180401T12:34:01之类的符号链接步骤或一些类似的模式来建立“使用最新的”输入结构是有用的。尝试明确区分可能需要根据env进行更改的配置设置（my_project/__init__.py，config.py，settings.py或其他任何内容）（设想将fs换成blobstore或其他内容）。 setup.py通常位于最高级别my_project/setup.py中，并且位于my_project/my_project中与可运行内容（不是文档，示例不确定）相关的任何内容。在一个地方（_mydir = os.path.dirname(os.path.realpath(__file__))）中定义一个config.py，并以此为依据避免痛苦。

Answer 2

否，只有在使用Django的情况下，才可以使用settings.py。至于以这种方式引用数据目录，取决于您是否希望用户能够更改此值。设置它来更改值的方式需要编辑settings.py文件。如果您希望用户拥有默认值，但又希望他们在使用函数时可以轻松更改它，则只需内联创建基本路径值，然后在def data_sample（...，datadir = filepath）:.中将其设为默认值即可。 / p>

Answer 3

我正在维护一个基于DataDriven Cookiecutter的经济学数据项目，我认为这是一个很好的模板。

分离数据文件夹和代码对我来说是一个优势，可以将您的工作视为直接转换的流程（'DAG'），从不可变的初始数据开始，一直到最终结果。

最初，我回顾了pkg_resources，但拒绝使用它（语法冗长且缺乏对打包的理解），而是支持在目录中导航的自己的辅助函数/类。

本质上，助手要做两件事

1。坚持项目根文件夹和其他常量路径：

# shorter version 
ROOT = Path(__file__).parents[3]

# longer version
def find_repo_root():
    """Returns root folder for repository.
    Current file is assumed to be:
        <repo_root>/src/kep/helper/<this file>.py
    """
    levels_up = 3
    return Path(__file__).parents[levels_up]

ROOT = find_repo_root()
DATA_FOLDER = ROOT / 'data' 
UNPACK_RAR_EXE = str(ROOT / 'bin' / 'UnRAR.exe')
XL_PATH = str(ROOT / 'output' / 'kep.xlsx')

这类似于您对DATA_DIR所做的操作。一个可能的弱点是我在这里手动对助手文件相对于项目根目录的相对位置进行硬编码。如果帮助文件位置已移动，则需要对其进行调整。但是，嘿，这与Django中的操作相同。

2。允许访问raw，interim和processed文件夹中的特定数据。

这可以是一个简单的函数，它通过文件夹中的文件名返回完整路径，例如：

def interim(filename):
    """Return path for *filename* in 'data/interim folder'."""
    return str(ROOT / 'data' / 'interim' / filename)

在我的项目中，我有interim和processed目录的年月子文件夹，并且按年，月，有时还按频率寻址数据。对于这种数据结构，我有提供参考特定路径的InterimCSV和ProcessedCSV类，例如：

from . helper import ProcessedCSV, InterimCSV
 # somewhere in code
 csv_text = InterimCSV(self.year, self.month).text()
 # later in code
 path = ProcessedCSV(2018,4).path(freq='q')

辅助程序is here的代码。另外，这些类会创建子文件夹（如果不存在）（我希望在临时目录中进行子测试），并且有一些方法可以检查文件是否存在以及读取其内容。

在您的示例中，您可以轻松地将根目录固定在setting.py中，但我认为您可以进一步抽象数据。

当前data_sample()混合了文件访问和数据转换，不是一个好兆头，并且还使用一个全局名称，这是函数的另一个不好的兆头。建议您考虑以下事项：

# keep this in setting.py
def processed(filename):
   return os.path.join(DATA_DIR, filename)

# this works on a dataframe - your argument is a dataframe,
# and you return a dataframe
def transform_sample(df: pd.DataFrame, code=None) -> pd.DataFrame:
    # FIXME: what is `code`?
    if not code:
        code = random.choice(df.code.unique())
    return df[df.code == code].sort_values('Date')

# make a small but elegant pipeline of data transfomation
file_path = processed('my_data')
df0 = pd.read_parquet(file_path)
df = transform_sample(df0)

Answer 4

您可以使用<!DOCTYPE html> <html> <head> <title>Example 01.02 - First Scene</title> <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/three.js/110/three.min.js"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/dat-gui/0.7.6/dat.gui.min.js"></script> <script src="https://cdn.jsdelivr.net/npm/three@0.101.1/examples/js/controls/OrbitControls.js"></script> <style> body { margin: 0; overflow: hidden; } </style> </head> <body>  <div id="WebGL-output"> </div> </body> </html> window.onload = init(); animate(); //calling function that does all the rendering //GLOBAL VARS var scene, camera, renderer; var cube; var raycaster, mouse; var INTERSECTED; //global flag var isClicked = false; //for the camera var controls; //creating and rendering the GUI params = { yAxis: "0.00001" } var gui = new dat.GUI(); gui.add(params, "yAxis").onFinishChange(val => { cube.scale.y = parseFloat(val); }); //we make sure to make it hidden let vis = gui.domElement.style.visibility; gui.domElement.style.visibility = vis == "" ? "hidden" : ""; // once everything is loaded, we run our Three.js stuff. function init() { // create a scene, that will hold all our elements such as objects, cameras and lights. scene = new THREE.Scene(); //SET CAMERA camera = new THREE.PerspectiveCamera(75,window.innerWidth/window.innerHeight,0.1,1000) camera.position.z = 5; // create a render and set the size renderer = new THREE.WebGLRenderer({antialias: true}); renderer.setClearColor("#e5e5e5"); //background color renderer.setSize(window.innerWidth,window.innerHeight); //size of renderer //bind rendered to the dom element document.getElementById("WebGL-output").appendChild(renderer.domElement); //RAYCASTER raycaster = new THREE.Raycaster(); mouse = new THREE.Vector2(1,1); // create a cube var cubeGeometry = new THREE.BoxGeometry(20, 20, 20); var cubeMaterial = new THREE.MeshLambertMaterial({color: 0xffff00 }); //0xF7F7F7 = gray cube = new THREE.Mesh(cubeGeometry, cubeMaterial); cube.scale.y = 0.00001; cube.userData.originalColor = 0xffff00; // position the cube cube.position.x = 0; cube.position.y = 3; cube.position.z = 0; /* //USEFUL METHODS cube.rotation.x +=0.5 cube.scale.x +=0.5 */ // add the cube to the scene scene.add(cube); /* RENDERING A PLANE var geometry = new THREE.PlaneGeometry( 20, 20); var material = new THREE.MeshBasicMaterial( {color: 0xffff00, side: THREE.DoubleSide} ); var plane = new THREE.Mesh( geometry, material ); plane.rotation.set(80,0,0); scene.add( plane ); //plane.position.x = 2; */ //ADDING LIGHTS var ambientLight = new THREE.AmbientLight(0x0c0c0c); scene.add(ambientLight); var spotLight = new THREE.SpotLight(0xffffff); spotLight.position.set(-40, 60, -10); spotLight.castShadow = true; scene.add(spotLight); // position and point the camera to the center of the scene camera.position.x = -30; camera.position.y = 40; camera.position.z = 30; camera.lookAt(scene.position); //camera controls = new THREE.OrbitControls(camera, renderer.domElement); controls.minDistance = 1; controls.maxDistance = 1000; // when the mouse moves, call the given function document.addEventListener('mousemove', onDocumentMouseMove, false); //when the mouse is clicked, call the given function document.addEventListener('click', onDocumentMouseClick, false); } function onDocumentMouseMove(event) { // the following line would stop any other event handler from firing // (such as the mouse's TrackballControls) event.preventDefault(); // update the mouse variable mouse.x = (event.clientX / window.innerWidth) * 2 - 1; mouse.y = -(event.clientY / window.innerHeight) * 2 + 1; // calculate objects intersecting the picking ray var intersects = raycaster.intersectObjects( scene.children ); //TRY THIS // intersects = raycaster.intersectObject(cube); // to get the cube only //if the mouse hovers over the cube mesh, change its color to red //when mouse leaves the mesh, change it back to its original color //ONLY MAKE THESE MODIFICATION IF THE MESH IS NOT CLICKED //BECAUSE IF IT IS CLICKED, YOU HAVE TO PAINT THE MESH ACCORDING TO THE onDocumentMouseClick() if ( intersects.length > 0 && intersects[ 0 ].object === cube && isClicked === false) { cube.material.color.set( 0xF7F7F7 ); } else if (isClicked === false) { cube.material.color.set( cube.userData.originalColor ); } } // 0xff0000 red //0xF7F7F7 = gray function onDocumentMouseClick(event) //if we detect a click event { // the following line would stop any other event handler from firing // (such as the mouse's TrackballControls) event.preventDefault(); // update the mouse variable mouse.x = (event.clientX / window.innerWidth) * 2 - 1; mouse.y = -(event.clientY / window.innerHeight) * 2 + 1; // calculate objects intersecting the picking ray var intersects = raycaster.intersectObjects( scene.children ); //if mouse is on top of the mesh when the click occurs, change color of mesh and render GUI if ( intersects.length > 0 && intersects[ 0 ].object === cube && isClicked === false) { isClicked = true; cube.material.color.set( 0xff0000); /* var params = { textField: "Enter value:" } var item = gui.add(params, "textField").onFinishChange(function (value) { //Do something with the new value //console.log(value); cube.scale.y +=value; }); */ //when its clicked, change the visibily of the GUI vis = gui.domElement.style.visibility; gui.domElement.style.visibility = vis == "" ? "hidden" : ""; } //if mouse is on top of the mesh when the click occurs, but it already marked as 'clicked', now mark it as 'unclicked' else if ( intersects.length > 0 && intersects[ 0 ].object === cube && isClicked === true) { isClicked = false; cube.material.color.set( cube.userData.originalColor ); //when its clicked, change the visibily of the GUI vis = gui.domElement.style.visibility; gui.domElement.style.visibility = vis == "" ? "hidden" : ""; // gui.__proto__.constructor.toggleHide() //dat.GUI.toggleHide(); //gui.toggleHide() } } function render() { // update the picking ray with the camera and mouse position raycaster.setFromCamera( mouse, camera ); renderer.render(scene, camera); //render the scene } function animate() { requestAnimationFrame( animate ); //pauses when user switches tab controls.update(); render(); }打开文件并将其保存在变量中，并在希望引用文件的任何地方继续使用该变量。

open()

或

with open('Test.txt','r') as f:

，然后使用f=open('Test.txt','r')来引用文件。如果您希望文件可读写，则可以使用f代替r+。

在数据科学项目中引用文件的优雅方式

4 个答案: