Question

我的表设置如下

id
origin
destination
carrier_id

如此典型的行可以是，

100: London    Manchester  366

现在每条路线都是双向的，所以不应该有这样的行

233: Manchester    London    366

因为这基本上是相同的路线（无论如何我的目的）

不幸的是，虽然我已经完成了一些重复。我有超过50,000条路线，由大约2000个起点（或目的地，但你想看它）组成。因此，我正在考虑循环遍历每个原点，以发现重复将是疯狂的。

所以我甚至不知道从哪里开始尝试找出一个查询来识别它们。有什么想法吗？

Answer 1

我认为你只需要一个双连接，以下将识别连接在一起的所有“重复”记录。

这是一个例子。

说SELECT * FROM FLIGHTS产生了：

id  origin   destination  carrierid
1   toronto  quebec      1
2   quebec   toronto     2
3   edmonton calgary     3
4   calgary  edmonton    4
5   hull     vancouver   5
6   vancouveredmonton    6
7   edmonton toronto     7
9   edmonton quebec      8
10   toronto  edmonton  9
11   quebec   edmonton  10
12   calgary  lethbridge 11

所以有一堆重复（其中4条路线与其他路线重复）。

select  *
from    flights t1 inner join flights t2 on t1.origin = t2.destination 
        AND t2.origin = t1.destination

只会产生重复：

id  origin   destination carrierid  id  origin  destination carrierid
1   toronto quebec       1  2   quebec  toronto 2
2   quebec  toronto      2  1   toronto quebec  1
3   edmonton    calgary 3   4   calgary edmonton    4
4   calgary edmonton    4   3   edmonton    calgary 3
7   edmonton    toronto 7   10  toronto edmonton    9
9   edmonton    quebec  8   11  quebec  edmonton    10
10  toronto edmonton    9   7   edmonton    toronto 7
11  quebec  edmonton    10  9   edmonton    quebec  8

此时你可能会删除所有发生的第一个。

delete from flights
where id in (
    select  t1.id
    from    flights t1 inner join flights t2 on t1.origin = t2.destination 
            AND t2.origin = t1.destination
)

祝你好运！

Answer 2

无赖！在我的头顶（和psuedo-sql）：

select * from (
  select id, concat(origin, '_', destination, '_', carrier_id) as key from ....
  union
  select id, concat(destination, '_', origin, '_', carrier_id) as key from ....

) having count(key) > 1;

对于上述记录，您最终得到：

100, London_Manchester_366
100, Manchester_Longer_366
233 Manchester_London_366
233 London_Manchester_366

这真的，真的是hackish，并没有准确地告诉你你正在做什么 - 它只会缩小它。也许它会给你一个起点？也许它会给别人一些他们可以提供帮助的想法。

Answer 3

如果你不介意一点shell脚本，并且你可以在这里显示的表单中获取输入的转储......这里是我的示例输入：

100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
233: Manchester London 366

你可能会做这样的事情：

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | sort
CityA CityB 150:
CityA London 121:
CityA Manchester 144:
London Manchester 100:
London Manchester 233:

这样你至少可以将这些对组合在一起。不确定那里最好的举动是什么。

<小时/> 好的，这是一个命令行的野兽：

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | (sort; echo "") | awk '{ if (fst == $1 && snd == $2) { printf "%s%s", num, $3 } else { print fst, snd; fst = $1; snd = $2; num = $3} }' | grep "^[0-9]"
150:151:150:255:CityA CityB
100:233:London Manchester

其中m.txt包含以下新内容：

100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
151: CityB CityA 90
233: Manchester London 366
255: CityA CityB 90

Perl可能是比awk更好的选择，但是请注意：首先我们对两个城市名称进行排序，并将ID放在字符串的末尾，我在第一部分中做了。然后我们将它们分组到一起组合，我们必须为awk脚本添加额外的一行来完成。然后，我们遍历文件中的每一行。如果我们看到一对新城市，我们会打印我们之前看到的城市，并存储新城市和新ID。如果我们看到上次看到的相同城市，那么我们会打印出上一行的ID和该行的ID。最后，我们只搜索以数字开头的行，以便我们丢弃非重复的对。

如果一对发生两次以上，你会得到一个重复的ID，但这不是什么大问题。

清除泥土？

在MYSQL表中传播不重复的重复项

3 个答案: