Question

I have a dataset of hotel reviews. Each file in the dataset is for a different hotel. I have been asked to "Write down the relation you identify in the dataset. Ensure you include data types and the primary key." Here is an example file from my dataset:

<Overall Rating>4
<Avg. Price>$173
<URL>http://...

<Author>everywhereman2
<Content>Old seattle getaway...
<Date>Jan 6, 2009
<img src="http://cdn.tripadvisor.com/img2/new.gif" alt="New"/>
<No. Reader>-1
<No. Helpful>-1
<Overall>5
<Value>5
<Rooms>5
<Location>5
<Cleanliness>5
<Check in / front desk>5
<Service>5
<Business service>5

<Author>RW53
<Content>Location! Location?       view from room of nearby freeway 
<Date>Dec 26, 2008
<No. Reader>-1
<No. Helpful>-1
<Overall>3
<Value>4
<Rooms>3
<Location>2
<Cleanliness>4
<Check in / front desk>3
<Service>-1
<Business service>-1

...new review e.t.c

The Author to Business service section(line 5 to line 18) represents a review for the hotel. The file will then continue for however many reviews there are for that hotel repeating lines 5 through 18. I hope that makes sense. Here is what I think the relation is:

HotelReview(String: Author, String: Content, Date: Date, String: img src, Int: No. Reader, Int: No. Helpful, Int: Overall, Int: Value, Int: Rooms, Int: Location, Int: Cleanliness, Int: Checkin / front desk, Int: Service, Int: Business Service)

or would it be

HotelReview(Int: Overall Rating, Int: Avg. Price, String: URL)

I may be way off as I am new to this stuff, I appreciate any help. Thanks

Answer 1

也许我不是最好的答案，但我会试一试。

首先，您可能希望查找有关数据库架构设计的一些书籍或博客。这应该为您提供有关如何处理此任务的一般指导。

然后，根据显示的数据，您可能会识别出两个实体：

酒店（从标题部分）
具有以下属性：
- OverallRating int
- AveragePrice int
- 网址网址或文字
前两个实际上可能是从其他来源派生（计算）的值，但如所示，这将类似于真正的属性值）
可能是这样， URL 是主键，如果没有其他更合适的值未在示例中显示。
酒店评论（来自重复部分）
使用属性：
- 酒店（酒店实体的外键）
- Autor文本（如果作者“众所周知”，则可能是作者表的外键）
- 日期日期
- img url或text（或者这是指另一张表？）
- 使用类型Int
对于给定的数据，此关系没有“好”主键。您所拥有的只是作者和日期作为组合键。但使用它意味着作者可能每天只提供一次评论。如果这是一个合理的限制，那就继续吧。否则，您需要引入更多属性（例如审核时间以通过此限制获取或仅为评论引入序列号，以便唯一标识评论并可作为主键。

使用指定的类型，您当然应遵循类型系统可用的类型。如果没有，则指示最精确的类型并提供类型和语义列表。例如。您可以使用类型得分表示这是一个从-1到10的整数值，其中-1表示“故意没有价值”，其他是 10 的潜在得分最好成绩。然后将thos用于不同的类别。

What is the Relation in my Dataset of Hotel Reviews?

1 个答案: