What is the Relation in my Dataset of Hotel Reviews?

时间:2016-04-04 17:55:36

标签: dataset relation relational-algebra

I have a dataset of hotel reviews. Each file in the dataset is for a different hotel. I have been asked to "Write down the relation you identify in the dataset. Ensure you include data types and the primary key." Here is an example file from my dataset:

<Overall Rating>4
<Avg. Price>$173
<URL>http://...

<Author>everywhereman2
<Content>Old seattle getaway...
<Date>Jan 6, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>5
<Value>5
<Rooms>5
<Location>5
<Cleanliness>5
<Check in / front desk>5
<Service>5
<Business service>5

<Author>RW53
<Content>Location! Location?       view from room of nearby freeway 
<Date>Dec 26, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>3
<Value>4
<Rooms>3
<Location>2
<Cleanliness>4
<Check in / front desk>3
<Service>-1
<Business service>-1

...new review e.t.c

The Author to Business service section(line 5 to line 18) represents a review for the hotel. The file will then continue for however many reviews there are for that hotel repeating lines 5 through 18. I hope that makes sense. Here is what I think the relation is:

HotelReview(String: Author, String: Content, Date: Date, String: img src, Int: No. Reader, Int: No. Helpful, Int: Overall, Int: Value, Int: Rooms, Int: Location, Int: Cleanliness, Int: Checkin / front desk, Int: Service, Int: Business Service) 

or would it be

HotelReview(Int: Overall Rating, Int: Avg. Price, String: URL) 

I may be way off as I am new to this stuff, I appreciate any help. Thanks

1 个答案:

答案 0 :(得分:1)

也许我不是最好的答案,但我会试一试。

首先,您可能希望查找有关数据库架构设计的一些书籍或博客。这应该为您提供有关如何处理此任务的一般指导。

然后,根据显示的数据,您可能会识别出两个实体:

  • 酒店(从标题部分)
    具有以下属性:

    • OverallRating int
    • AveragePrice int
    • 网址网址或文字

    前两个实际上可能是从其他来源派生(计算)的值,但如所示,这将类似于真正的属性值)
    可能是这样, URL 是主键,如果没有其他更合适的值未在示例中显示。

  • 酒店评论(来自重复部分)
    使用属性:

    • 酒店(酒店实体的外键)
    • Autor文本(如果作者“众所周知”,则可能是作者表的外键)
    • 日期日期
    • img url或text(或者这是指另一张表?)
    • 使用类型Int
    • 的所有值属性

    对于给定的数据,此关系没有“好”主键。您所拥有的只是作者日期作为组合键。但使用它意味着作者可能每天只提供一次评论。如果这是一个合理的限制,那就继续吧。否则,您需要引入更多属性(例如审核时间以通过此限制获取或仅为评论引入序列号,以便唯一标识评论并可作为主键。

使用指定的类型,您当然应遵循类型系统可用的类型。如果没有,则指示最精确的类型并提供类型和语义列表。例如。您可以使用类型得分表示这是一个从-1到10的整数值,其中-1表示“故意没有价值”,其他是 10 的潜在得分最好成绩。然后将thos用于不同的类别。