Question

我正在尝试找到一种基于与各个“事物”相关但也相互关联的数据点来开发事物组的方法。

例如，假设我正在尝试将我在家收到的垃圾邮件分组。当我收到这封信时，我会记录：

邮戳城市和州
返回地址街道，商家名称，城市，州和邮编
商家的电话号码
信封的大小和颜色
打印信封时使用的字体

我们还要说，随着时间的推移，我发现相同的返回地址和/或电话号码会显示不同的商家名称。我可以推断所有这些信件都可能是由同一个公司发送的。我可以将地址，电话号码和商家名称关联为同一个“实体”。

或者，我看到完全不同的地址和电话号码，但信封上的大小，颜色，邮戳和字体都是一样的。我可以推断（不太确定）这些也可能来自同一个企业。

我正在寻找的是获取此类数据的最佳方法，并根据重叠数据将其分组为“桶”（实体）...使用SQL Server，Analysis Services或其中某些组合......因此，我最终得到一种输入单个数据点的方法，以查看它是否与任何其他数据相关（例如，从一封信中输入一个电话号码来查看该实体或一组字母，它与之相关）。

有人可以指出我正确的方向吗？

提前致谢！

Answer 1

对于我要问的每条记录，“这有多独特？”然后根据它开始逻辑分解......

状态：具有大量重复可能性的低唯一数据集;创建一个带有标识列的[State]表，如果可能，使用所有可能的值预先填充它，以减少非聚集索引上的索引碎片。

Create  Table [dbo].[State] (StateID Int Identity, StateName Varchar(32))
Create  Unique Clustered Index ix_stateID On [dbo].[State] (StateID)
Create  Unique NonClustered Index ix_SN On [dbo].[State] (StateName)

ZipCode：中等唯一数据集，具有很好的复制机会，但每个ZipCode都与单个状态相关联。再一次，预先填充这个可能对避免渐进式索引碎片很有用，但是根据你期望它增长的速度，它可能会让它按原样填满并定期重新索引。如果你只是跟踪美国地址，只有预先填充的前五位数就可以了（如果你这样做的话，将ZipCode列改为Int）。

Create  Table [dbo].[ZipCode] (ZipCodeID Int Identity, StateID Int, ZipCode Varchar(16))
Create  Unique Clustered Index ix_zipcodeID On [dbo].[ZipCode] (ZipCodeID)
Create  Unique NonClustered Index ix_stateID_ZC On [dbo].[ZipCode] (StateID, ZipCode)

城市：这个表格有一个相当大的数据集，但仍然有大量重复的机会，所以我们将再次创建一个标识值，但这次我绝对不会预先填充。

Create  Table [dbo].[City] (CityID Int Identity, ZipCodeID Int, CityName Varchar(64))
Create  Unique Clustered Index ix_cityID On [dbo].[City] (CityID)
Create  Unique NonClustered Index ix_zipcodeID_C On [dbo].[City] (ZipCodeID, City)

街道地址：这是我们可以选择的地址，但我们仍然想创建一个ID列，因为我们可以从同一地址接收大量邮件。

Create  Table [dbo].[StreetAddress] (StreetAddressID Int Identity, CityID Int, StreetAddress Varchar(256))
Create  Unique Clustered Index ix_streetaddressID On [dbo].[StreetAddress] (StreetAddressID)
Create  Unique NonClustered Index ix_cityID_SA On [dbo].[StreetAddress] (CityID, StreetAddress)

对于电话号码，我可能会通过[AreaCode]和[PhoneNumber]将其分解为......

Create  Table [dbo].[AreaCode] (AreaCodeID Int Identity, AreaCode Int)
Create  Unique Clustered Index ix_areacodeID On [dbo].[AreaCode] (AreaCodeID)
Create  Unique NonClustered Index ix_AC On [dbo].[AreaCode] (AreaCode)

Create  Table [dbo].[PhoneNumber] (PhoneNumberID Int Identity, AreaCodeID Int, PhoneNumber Int)
Create  Unique Clustered Index ix_phonenumberID On [dbo].[PhoneNumber] (PhoneNumberID)
Create  Unique NonClustered Index ix_acID_PN On [dbo].[PhoneNumber] (AreaCodeID, PhoneNumber)

然后我会创建单个深度查找表（大小，颜色，字体等）

Create  Table [dbo].[Characteristic] (CharacteristicID Int Identity, Characteristic AppropriateDataType)
Create  Unique Clustered Index ix_characteristicID On [dbo].[Characteristic] (CharacteristicID)
Create  Unique NonClustered Index ix_abrevCharact On [dbo].[Characteristic] (Characteristic)

然后最后有你最独特的物品，你的邮件......

Create  Table [dbo].[Letter] (LetterID Int Identity, Received DateTime, StreetAddressID Int, PhoneNumberID Int, CharacteristicIDs ...)

根据您最常运行的查询，找出哪些索引在[dbo]。[Letter]表中有意义，高效的查询应该像使用必要的连接和逻辑编写适当的查询一样简单。：）

那是我的2美分。

使用相关数据点构建组

1 个答案: