Question

我已经分配了这个问题，我需要在两个数据集之间匹配移动应用和发布商（一个是GooglePlay，另一个是iTunes）。

以下是对iTunes数据集中使用的变量的描述（Google Play数据集变量名称是类似的或相同的。）

anon_ios_app_id: anonymized iOS app id

anon_ios_publisher_id: anonymized iOS publisher id

points: the “worth” of the match, 10 points is highest worth and 0.5 is the lowest.

ios_name: name of the mobile app in the itunes store

ios_publisher_name: name of the publisher of the app in the itunes store

category_name: the category of the app

type: Game or Non-game

我已经做了一些分析，以查找共享相同名称和发布者的数据集中的应用名称。作为一个例子，我搜索了有＆＃34; Walmart＆＃34;在他们的名字。

GooglePlay <- read.csv("...\\GooglePlay.csv", header = TRUE)
iTunes <- read.csv("...\\iTunes.csv", header = TRUE)

grep("Walmart", iTunes$ios_name)
[1]  41203  51026  63522  64330 112441 113516 115510 117588 117788 119558 119605 120002 165514 248817
[15] 277425 290010 463244 546799 565806
grep("Walmart", GooglePlay$gp_name)
[1]    154  31984 162284 162342 162792 168722 168774 169339 325520 325601 357122 360050 436084 437144
[15] 441458 447177 503260

在我的分析过程中，我发现某些应用在两个数据集中都有相同的名称和发布者。例如

GooglePlay$gp_name[154]
[1] Walmart Photo
GooglePlay$gp_publisher_name[154]
[1] Kodak Alaris Inc.
iTunes$ios_name[165514]
[1] Walmart Photo
iTunes$ios_publisher_name[165514]
[1] Kodak Alaris Inc.

我的目标是：1。提供一个统一的文件，其中包含匹配的应用/发布者的所有相应ID /名称。 2.提供一个数字：匹配的应用程序的SUM（iOS点+ GP点）。

我应该使用哪些功能来匹配两个数据集中的应用和发布商？如何制作这些匹配的统一文件？

匹配两个数据集的列

0 个答案: