Question

我在Haskell中建模数据结构时遇到了麻烦。假设我经营动物研究设施，我想跟踪我的老鼠。我想跟踪老鼠到笼子里的分配实验。我也想跟踪我老鼠的体重我的笼子的体积，并记录我的实验。

在SQL中，我可能会这样做：

create table cages (id integer primary key, volume double);
create table experiments (id integer primary key, notes text)
create table rats (
    weight double,
    cage_id integer references cages (id),
    experiment_id integer references experiments (id)
);

（我意识到这允许我分配来自不同的两只老鼠实验到同一个笼子。这是有目的的。我实际上并没有跑动物研究设施。）

必须进行的两项手术：（1）给予老鼠，找到其笼子的体积，（2）给予大鼠，获得它所属的实验的记录。

在SQL中，那些将是

select cages.volume from rats
  inner join cages on cages.id = rats.cage_id
  where rats.id = ...; -- (1)
select experiments.notes from rats
  inner join experiments on experiments.id = rats.experiment_id
  where rats.id = ...; -- (2)

我如何在Haskell中建模这个数据结构？

一种方法是

type Weight = Double
type Volume = Double

data Rat = Rat Cage Experiment Weight
data Cage = Cage Volume
data Experiment = Experiment String

data ResearchFacility = ResearchFacility [Rat]

ratCageVolume :: Rat -> Volume
ratCageVolume (Rat (Cage volume) _ _) = volume

ratExperimentNotes :: Rat -> String
ratExperimentNotes (Rat _ (Experiment notes) _) = notes

但是这个结构不会引入Cage和Experiment s的一堆副本吗？或者我应该不担心它并希望优化器能够解决这个问题吗？

Answer 1

这是我用于测试的简短文件：

type Weight = Double
type Volume = Double

data Rat = Rat Cage Experiment Weight deriving (Eq, Ord, Show, Read)
data Cage = Cage Volume               deriving (Eq, Ord, Show, Read)
data Experiment = Experiment String   deriving (Eq, Ord, Show, Read)

volume     = 30
name       = "foo"
weight     = 15
cage       = Cage volume
experiment = Experiment name
rat        = Rat cage experiment weight

然后我开始使用ghci并导入System.Vacuum.Cairo，可以从令人愉快的vacuum-cairo包中找到。

*Main System.Vacuum.Cairo> view (rat, Rat (Cage 30) (Experiment "foo") 15)

not-shared

*Main System.Vacuum.Cairo> view (rat, Rat (Cage 30) experiment 15)

shared-experiment

（我不确定为什么在这个箭头中有双倍的箭头，但你可以忽略/折叠它们。）

*Main System.Vacuum.Cairo> view (rat, Rat cage experiment weight)

shared-args

*Main System.Vacuum.Cairo> view (rat, rat)

shared-all

*Main System.Vacuum.Cairo> view (rat, Rat cage experiment (weight+1))

shared-modified

如上所述，经验法则是在调用构造函数时准确创建新对象;否则，如果您只是命名已创建的对象，则不会创建新对象。这在Haskell中是安全的，因为它是一种不可变的语言。

Answer 2

您的模型的更自然的Haskell表示将是笼子包含实际的鼠标对象而不是它们的ID：

data Rat = Rat RatId Weight
data Cage = Cage [Rat] Volume
data Experiment = Experiment [Rat] String

然后，您将使用智能构造函数创建ResearchFacility个对象，以确保它们遵循规则。它看起来像：

research_facility :: [Rat] -> Map Rat Cage -> Map Rat Experiment -> ResearchFacility
research_facility rats cage_assign experiment_assign = ...

其中cage_assign和experiment_assign是包含与sql中的cage_id和experiment_id外键相同信息的映射。

Answer 3

首先观察：你应该学会使用记录。 Haskell中的记录字段名称被视为函数，因此这些定义至少会让您输入less：

data Rat = Rat { getCage       :: Cage
               , getExperiment :: Experiment
               , getWeight     :: Weight }

data Cage = Cage { getVolume :: Volume }

-- Now this function is so trivial to define that you might actually not bother:
ratCageVolume :: Rat -> Volume
ratCageVolume = getVolume . getCage

至于数据表示，我可能会沿着这些方向前进：

type Weight = Double
type Volume = Double

-- Rats and Cages have identity that goes beyond their properties;
-- two distinct rats of the same weight can be in the same cage, and
-- two cages can have same volume.
-- 
-- So should we give each Rat and Cage an additional field to
-- represent its key?  We could do that, or we could abstract that out
-- into this:

data Identity i a = Identity { getId  :: i
                             , getVal :: a }
            deriving Show

instance Eq i => Eq (Identity i a) where
    a == b = getId a == getId b

instance Ord i => Ord (Identity i a) where
    a `compare` b = getId a `compare` getId b


-- And to simplify a common case:
type Id a = Identity Int a


-- Rats' only real intrinsic property is their weight.  Cage and Experiment?
-- Situational, I say.
data Rat = Rat { getWeight :: Weight  }

data Cage = Cage { getVolume :: Volume }

data Experiment = Experiment { getNotes :: String }
                  deriving (Eq, Show)

-- The data that you're manipulating is really this:
type RatData = (Id Rat, Id Cage, Id Experiment)

type ResearchFacility = [RatData]

Answer 4

我在日常工作中大部分时间都使用Haskell而且遇到过这个问题。我的经验是，创建数据结构的副本数量并不是一个问题，更多的是涉及的数据依赖性。我们使用类似的数据结构来帮助与存储实际数据的关系数据库进行交互。这意味着我们有这样的查询。

getCageById       :: IdType -> IO (Maybe Cage)
getRatById        :: IdType -> IO (Maybe Rat)
getExperimentById :: IdType -> IO (Maybe Experiment)

我们开始使用我们构建的数据结构，其中包含链接的数据结构。结果证明这是一个巨大的错误。问题是如果你对Rat ...使用以下定义

data Rat = Rat Cage Experiment Weight

...然后getRatById函数必须运行三个数据库查询才能返回结果。这似乎是一个很好的方便的做事方式，但它最终成为一个巨大的性能问题，特别是如果我们想要一个查询返回一堆结果。即使我们只想要来自rat table的行，数据结构也会强制我们进行连接。额外的数据库查询是问题，而不是RAM中额外对象的可能性。

现在我们的政策是，当我们制作与数据库表相对应的数据结构时，我们总是像表一样对它们进行非规范化。所以你的例子会变成这样：

type IdType = Int
type Weight = Double
type Volume = Double

data Rat = Rat
    { ratId        :: IdType
    , cageId       :: IdType
    , experimentId :: IdType
    , weight       :: Weight
    }
data Cage = Cage IdType Volume
data Experiment = Experiment IdType String

（您甚至可能希望使用newtypes来区分不同的ID。）获取整个结构需要做更多工作，但它可以让您有效地获取结构的某些部分。当然，如果您从不需要获得结构的各个部分，那么我的建议可能不合适。但我的经验是，部分查询很常见，我不想让它们人为地慢。如果你想要一个为你做连接的功能的便利，你当然可以写一个。但是，不要使用将您锁定在这种使用模式中的数据模型。

在Haskell中定义数据结构的建议

4 个答案: