I have a dataset of hotel reviews. Each file in the dataset is for a different hotel. I have been asked to "Write down the relation you identify in the dataset. Ensure you include data types and the primary key." Here is an example file from my dataset:
<Overall Rating>4
<Avg. Price>$173
<URL>http://...
<Author>everywhereman2
<Content>Old seattle getaway...
<Date>Jan 6, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>5
<Value>5
<Rooms>5
<Location>5
<Cleanliness>5
<Check in / front desk>5
<Service>5
<Business service>5
<Author>RW53
<Content>Location! Location? view from room of nearby freeway
<Date>Dec 26, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>3
<Value>4
<Rooms>3
<Location>2
<Cleanliness>4
<Check in / front desk>3
<Service>-1
<Business service>-1
...new review e.t.c
The Author to Business service section(line 5 to line 18) represents a review for the hotel. The file will then continue for however many reviews there are for that hotel repeating lines 5 through 18. I hope that makes sense. Here is what I think the relation is:
HotelReview(String: Author, String: Content, Date: Date, String: img src, Int: No. Reader, Int: No. Helpful, Int: Overall, Int: Value, Int: Rooms, Int: Location, Int: Cleanliness, Int: Checkin / front desk, Int: Service, Int: Business Service)
or would it be
HotelReview(Int: Overall Rating, Int: Avg. Price, String: URL)
I may be way off as I am new to this stuff, I appreciate any help. Thanks
答案 0 :(得分:1)
也许我不是最好的答案,但我会试一试。
首先,您可能希望查找有关数据库架构设计的一些书籍或博客。这应该为您提供有关如何处理此任务的一般指导。
然后,根据显示的数据,您可能会识别出两个实体:
酒店(从标题部分)
具有以下属性:
前两个实际上可能是从其他来源派生(计算)的值,但如所示,这将类似于真正的属性值)
可能是这样, URL 是主键,如果没有其他更合适的值未在示例中显示。
酒店评论(来自重复部分)
使用属性:
对于给定的数据,此关系没有“好”主键。您所拥有的只是作者和日期作为组合键。但使用它意味着作者可能每天只提供一次评论。如果这是一个合理的限制,那就继续吧。否则,您需要引入更多属性(例如审核时间以通过此限制获取或仅为评论引入序列号,以便唯一标识评论并可作为主键。
使用指定的类型,您当然应遵循类型系统可用的类型。如果没有,则指示最精确的类型并提供类型和语义列表。例如。您可以使用类型得分表示这是一个从-1到10的整数值,其中-1表示“故意没有价值”,其他是 10 的潜在得分最好成绩。然后将thos用于不同的类别。