应用错误收集

I need to process a lot of csv files that contains 3 columns: date, tv channel id, movie id.

Based on those columns, i need to classify what is the genre of each movie and the genre of tv channel id.

I'm new to big data process and i was wondering how can i classify that data if i only have an id (i can not use another source to search the id or generate random data to train my algorithm).

The solution that i found is define some range of hours and put the films that are on range inside some genre. Example:

movies that are played between 01:00-04:00, genre 1;
movies that are played between 04:01-06:00, genre 2;
etc.

After classify movies, i can classify the tv channels based on movies that they have played.

And i'm planning to do it using Spark :)

Anyone have another solution or any advice? It's kinda hard because those data looks like só abstract.

Thank you

当你说“我需要对电影的类型进行分类”时，你的意思是“戏剧”，“喜剧”，“动作”或“Genre1”，“Genre2”？我想在下面的第二个案例。

不要手动指定类型 - 使用聚类算法

首先，我不会仅根据播放电影的时间来分配流派。一般来说，我会阻止你手动进行聚类。这就是为此制作聚类算法的原因。这些功能使用功能对在某种程度上彼此相关的个人进行分组。

在您的情况下，有一个棘手的部分：每个数据点/行不是电影。因此，电影可能存在于不同的集群中，意味着具有不同的类型。

有几种选择：

一部电影适合不同类型 - 这很自然。
您只能根据电影最常出现的群组选择一种类型
如果你决定为每部电影分配多个类型，你可能会想到一个阈值：例如，如果一部电影在一个组中出现少于N次，那么它不属于该组（除非它是唯一的组它似乎）

创建新功能

您应该尽可能设计为许多新功能* ，帮助群集算法很好地分离数据并创建同类群集。

我能想到的，你可以这样做：

为您考虑的每个时间范围添加布尔功能（0:00 - 3:59; 4:00 - 6:00; ...）。这些功能中只有一个是：播放电影时。其他人都是空的。
计算电影播放次数的功能（黑衣人比 12愤怒的男士更多播放））
一项功能，宣传了多少频道ID播放了这部电影（星球大战比一些宝莱坞电影更多频道播放）
...

想想如何在所有频道中表现/播放流派并相应地创建角色。

PS：*不要误会我的意思，尽可能多的功能意味着比你的三个功能更多，但所谓的维度诅咒。

Analyse abstract data

1 个答案: