迁移网站时匹配重定向URL的算法

时间:2017-04-28 13:05:24

标签: algorithm machine-learning statistics string-matching fuzzy-search

假设我们有两组网址:源池目标池。源池基本上是网站上存在的URL的平面列表,而目标池包含同一网站的重制版本的所有URL。因此,对于源池中的大多数条目,目标池中应该有相应的类似条目。

我们希望通过将目标池中的一个网址与源池中的每个网址相匹配来创建重定向地图。相同的目标可以映射到多个

源池示例:

{
  "rooms": [
    {
      "_id": "590312ded3cd574e753833ae",
      "hostel": {
        "_id": "5902d6efa6aeca127a76d993",
        "category": "5902e9dc9b42c32bdacdc55f",
        "name": "New Hostel",
        "address": "#18-6-7,4th Line,Kedareswar Pet,VIJAYAWADA – 520003",
        "description": "Vel ei vide nulla conclusionemque, ut ius dolore vituperatoribus, iisque prodesset no mel. Pri id populo ceteros molestie, audiam evertitur ne nec. Vix ne doctus volutpat omittantur, at nominavi accommodare est. Minimum persequeris id per, ferri magna utinam pri id, per ubique scripta et",
        "phone": "2020202020",
        "__v": 0,
        "_created": "2017-04-28T05:36:51.520Z",
        "_creator": {
          "_username": "Administrator",
          "_id": "58e8a1234b82b216404827d8"
        },
        "warden": {
          "name": "James Harden",
          "address": "#18-6-7,4th Line,Kedareswar Pet,VIJAYAWADA – 520003",
          "phone": "2020202020"
        }
      },
      "floor": "Ground Floor",
      "roomNumber": "201",
      "numBeds": 3,
      "cost": 1000,
      "__v": 0,
      "_created": "2017-04-28T10:00:33.793Z",
      "_tenants": [

      ],
      "numOccupied": 0
    },
    {
      "_id": "5903133271e4ed4ec3cee1c8",
      "hostel": {
        "_id": "5902d6efa6aeca127a76d993",
        "category": "5902e9dc9b42c32bdacdc55f",
        "name": "New Hostel",
        "address": "#18-6-7,4th Line,Kedareswar Pet,VIJAYAWADA – 520003",
        "description": "Vel ei vide nulla conclusionemque, ut ius dolore vituperatoribus, iisque prodesset no mel. Pri id populo ceteros molestie, audiam evertitur ne nec. Vix ne doctus volutpat omittantur, at nominavi accommodare est. Minimum persequeris id per, ferri magna utinam pri id, per ubique scripta et",
        "phone": "2020202020",
        "__v": 0,
        "_created": "2017-04-28T05:36:51.520Z",
        "_creator": {
          "_username": "Administrator",
          "_id": "58e8a1234b82b216404827d8"
        },
        "warden": {
          "name": "James Harden",
          "address": "#18-6-7,4th Line,Kedareswar Pet,VIJAYAWADA – 520003",
          "phone": "2020202020"
        }
      },
      "floor": "Ground Floor",
      "roomNumber": "201",
      "numBeds": 3,
      "cost": 1000,
      "__v": 0,
      "_created": "2017-04-28T10:02:21.487Z",
      "_tenants": [

      ],
      "numOccupied": 0
    },
    {
      "_id": "590313555c13a24ef493721b",
      "hostel": {
        "_id": "5902d6efa6aeca127a76d993",
        "category": "5902e9dc9b42c32bdacdc55f",
        "name": "New Hostel",
        "address": "#18-6-7,4th Line,Kedareswar Pet,VIJAYAWADA – 520003",
        "description": "Vel ei vide nulla conclusionemque, ut ius dolore vituperatoribus, iisque prodesset no mel. Pri id populo ceteros molestie, audiam evertitur ne nec. Vix ne doctus volutpat omittantur, at nominavi accommodare est. Minimum persequeris id per, ferri magna utinam pri id, per ubique scripta et",
        "phone": "2020202020",
        "__v": 0,
        "_created": "2017-04-28T05:36:51.520Z",
        "_creator": {
          "_username": "Administrator",
          "_id": "58e8a1234b82b216404827d8"
        },
        "warden": {
          "name": "James Harden",
          "address": "#18-6-7,4th Line,Kedareswar Pet,VIJAYAWADA – 520003",
          "phone": "2020202020"
        }
      },
      "floor": "Ground Floor",
      "roomNumber": "201",
      "numBeds": 3,
      "cost": 1000,
      "__v": 0,
      "_created": "2017-04-28T10:02:58.857Z",
      "_creator": {
        "_id": "58e8a1234b82b216404827d8",
        "_username": "Administrator"
      },
      "_tenants": [

      ],
      "numOccupied": 0
    },
    {
      "_id": "590319882326be569b6cca9c",
      "hostel": {
        "_id": "58f212b655d9d353b25e742a",
        "name": "Some Hostel",
        "__v": 0,
        "description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. At ego quem huic anteponam non audeo dicere; Ad eas enim res ab Epicuro praecepta dantur",
        "phone": "8095478346",
        "address": "22d - 6 - 4, Ramakrishnapuram",
        "category": "5901b735f1b274473e710c66",
        "_created": "2017-04-15T12:31:41.923Z",
        "_creator": {
          "_username": "Administrator",
          "_id": "58e8a1234b82b216404827d8"
        },
        "warden": {
          "name": "Warned Marlin Monroe",
          "address": "22d - 6 - 4, Ramakrishnapuram",
          "phone": "8095478346"
        }
      },
      "floor": "Test Floor",
      "roomNumber": "Test Number",
      "numBeds": 2,
      "cost": 2000,
      "__v": 0,
      "_created": "2017-04-28T10:25:50.825Z",
      "_creator": {
        "_id": "58e8a1234b82b216404827d8",
        "_username": "Administrator"
      },
      "_tenants": [

      ],
      "numOccupied": 0
    },
    {
      "_id": "59030943c9bf7846386f4da1",
      "__v": 0,
      "cost": 0,
      "numBeds": 0,
      "roomNumber": "000",
      "floor": "Unknown Floor",
      "hostel": {
        "_id": "58f212b655d9d353b25e742a",
        "name": "Some Hostel",
        "__v": 0,
        "description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. At ego quem huic anteponam non audeo dicere; Ad eas enim res ab Epicuro praecepta dantur",
        "phone": "8095478346",
        "address": "22d - 6 - 4, Ramakrishnapuram",
        "category": "5901b735f1b274473e710c66",
        "_created": "2017-04-15T12:31:41.923Z",
        "_creator": {
          "_username": "Administrator",
          "_id": "58e8a1234b82b216404827d8"
        },
        "warden": {
          "name": "Warned Marlin Monroe",
          "address": "22d - 6 - 4, Ramakrishnapuram",
          "phone": "8095478346"
        }
      },
      "_created": "2017-04-28T09:20:02.382Z",
      "_tenants": [

      ],
      "numOccupied": 0
    }
  ]
}

示例目标池:

{
  "hostels": [
    {
      "_id": "5902d6efa6aeca127a76d993",
      "category": "5902e9dc9b42c32bdacdc55f",
      "name": "New Hostel",
      "address": "#18-6-7,4th Line,Kedareswar Pet,VIJAYAWADA – 520003",
      "description": "Vel ei vide nulla conclusionemque, ut ius dolore vituperatoribus, iisque prodesset no mel. Pri id populo ceteros molestie, audiam evertitur ne nec. Vix ne doctus volutpat omittantur, at nominavi accommodare est. Minimum persequeris id per, ferri magna utinam pri id, per ubique scripta et",
      "phone": "2020202020",
      "__v": 0,
      "_created": "2017-04-28T05:36:51.520Z",
      "_creator": {
        "_username": "Administrator",
        "_id": "58e8a1234b82b216404827d8"
      },
      "warden": {
        "name": "James Harden",
        "address": "#18-6-7,4th Line,Kedareswar Pet,VIJAYAWADA – 520003",
        "phone": "2020202020"
      }
    },
    {
      "_id": "58f212b655d9d353b25e742a",
      "name": "Some Hostel",
      "__v": 0,
      "description": "Lorem ipsum dolor sit amet, consectetur adipiscing elit. At ego quem huic anteponam non audeo dicere; Ad eas enim res ab Epicuro praecepta dantur",
      "phone": "8095478346",
      "address": "22d - 6 - 4, Ramakrishnapuram",
      "category": "5901b735f1b274473e710c66",
      "_created": "2017-04-15T12:31:41.923Z",
      "_creator": {
        "_username": "Administrator",
        "_id": "58e8a1234b82b216404827d8"
      },
      "warden": {
        "name": "Warned Marlin Monroe",
        "address": "22d - 6 - 4, Ramakrishnapuram",
        "phone": "8095478346"
      }
    }
  ]
}

注意:您的答案不应过于具体提供的示例,例如添加作者名称空间。相反,假设只有两组基于字符串的pretty urls具有不同的重叠/相似性。我们正在寻找一种基于两个网址引用相同内容的可能性来创建重定向地图的一般解决方案。

所以这就是问题:

  1. 是否有针对此应用程序的现成解决方案(重定向网站迁移),获取两个网址列表并返回映射?
  2. 如果没有,哪种通用算法适合从列表中选出一个最可能的候选人?
  3. 匹配概率的附加输出有助于稍后手动检查列表。

1 个答案:

答案 0 :(得分:1)

好的,回答我自己的问题:

  1. 与此同时,一个工具已经开始生效,正是为了这个应用而设计的。可在此处找到:https://github.com/jsphpl/redirect-mapper
  2. 对于我目前的申请,Levenshtein distance已经证明是一个很好的类似指标,但在任何情况下都可能效果不佳。从理论上讲,您可以使用任何现有的String metric或构建自己的算法,最适合您当前的需求。