Question

我正在考虑一种很好的方法来组合几个数据集，其中每个数据集在同一组项目上具有略微不同的信息集，因此包括稍微不同的数据。例如：

+----+------+--------+-------+------------------+----------+---------+---------+
| h  |  db  |  name  |  age  |     location     |  colour  |  fruit  |  height |
+----+------+--------+-------+------------------+----------+---------+---------+
| 1  |  b   |  joe   |   22  |  redbush ave     |  blue    |  pear   |  _      |
| 2  |  b   |  joe   |   22  |  redbush avenue  |  blue    |  paer   |  _      |
| 3  |  c   |  macy  |   38  |  high street     |  green   |  apple  |  1.65   |
| 4  |  c   |  j. h  |   22  |  redbush         |  blue    |  pear   |  1.59   |
+----+------+--------+-------+------------------+----------+---------+---------+

从那组行（即从DB b和c合并而来，我想得到：

+----+------+-----------+-------+-------------------------------+----------+--------------+---------+
| h  |  db  |  name     |  age  |        location               |  colour  |    fruit     |  height |
+----+-------+--- ------+-------+-------------------------------+----------+--------------+---------+
| 1  |  X   |  joe, j.h |  22   |  redbush ave, redbush avenue  |  blue    |  pear, paer  |    1.59 |
| 2  |  X   |  macy     |  38   |  high street                  |  green   |  apple       |    1.65 |
+----+------+-----------+-------+-------------------------------+----------+--------------+---------+

即。已经组合了3个非常相似的行，并且在它们的数据不同的情况下，添加了所有版本。

我正在尝试学习python所以已经找到了大熊猫的东西，使用groupby，连接所有列和比较，以及fuzzywuzzy，但似乎没有什么匹配。我猜测答案将涉及编辑/ Levenshtein距离，但我正在努力找出方法。

谢谢你的帮助，

马特

如何在数据集中对相似的行进行分组？

0 个答案: