计算文本之间的相似度以查找重复项

时间:2019-04-09 10:26:21

标签: r duplicates similarity

我有一些类似于以下的数据,通过处理数据的方式,我确实有一些重复/重复的行,这是不可避免的。

我想计算文本之间的余弦距离。然后尝试删除重复的值(保留具有最多文本的观察值)。

这是在数据中查找重复文本的最佳方法吗?通过删除一些单词,文本可能会略有不同,因此unique(text)只能解决部分问题。

数据:

text <- c("Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football.[1][2] These different variations of football are known as football codes.",
          "Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football.[1][2]",
          "Tennis is a racket sport that can be played individually against a single opponent (singles) or between two teams of two players each (doubles). Each player uses a tennis racket that is strung with cord to strike a hollow rubber ball covered with felt over or around a net and into the opponent's court. The object of the game is to maneuver the ball in such a way that the opponent is not able to play a valid return. The player who is unable to return the ball will not gain a point, while the opposite player will.",
          "Tennis is a racket sport that can be played individually against a single opponent (singles) or between two teams of two players each (doubles). Each player uses a tennis racket that is strung with cord to strike a hollow rubber ball covered with felt over or around a net and into the opponent's court. The object of the game is to maneuver the ball in such a way that the opponent is not able to play a valid return.",
          "Rugby refers to the team sports rugby league and rugby union. Legend claims that rugby football was started about 1845 in Rugby School, Rugby, Warwickshire, England, although forms of football in which the ball was carried and tossed date to medieval times. Rugby eventually split into two sports in 1895 when twenty-one clubs split from the original Rugby Football Union, to form the Northern Union (later to be named rugby league in 1922) in the George Hotel, Huddersfield, Northern England over the issue of payment to players, thus making rugby league the first code to turn professional and pay its players, rugby union turned fully professional in 1995. Both sports are run by their respective world governing bodies World Rugby (rugby union) and the Rugby League International Federation (rugby league). Rugby football was one of many versions of football played at English public schools in the 19th century.[1][2] Although rugby league initially used rugby union rules, they are now wholly separate sports. In addition to these two codes, both American and Canadian football evolved from rugby football.")


ID <- c("Foot123", "Foot123", "Ten123", "Ten123", "Rugby123")

data <- data.frame(text, ID)

1 个答案:

答案 0 :(得分:1)

也许您可以使用RecordLinkage pkg中的jarowinkler。

这里是示例代码。

NA

现在您需要确定想要使文本相似的程度。