哪些Python库最适合测试字符串

时间:2018-05-15 11:10:59

标签: python list text pattern-matching

哪些Python库最适合测试字符串?

我有一份来自新西兰奥克兰市郊区的名单

我还有几百万行数据,其中地址有一个郊区字段,但这些都是自由形式输入的。这意味着它们有拼写错误和各种奇怪的问题。 (比如使用MT代替Mount和其他类似数据输入操作符使用非标准约定的东西)

我想使用python来找出解决每个条目所涉及的郊区的方法。但我甚至不知道从哪里开始。 (我还处于相当基本的Python级别)

我曾想过可能为每个字母分配数字然后尝试围绕kNN匹配构建一些函数,其他人建议在某种程度上使用Jaccard相似性。

我想要匹配的郊区列表:

['ABBOTSFORD', 'ACACIA BAY', 'ADDISON', 'AHUROA', 'AIRPORT OAKS', 'ALBANY', 'ALBANY HEIGHTS', 'ALBANY NORTH', 'ALBANY SOUTH', 'ALBANY WEST', 'ALEXANDRA', 'ALFRISTON', 'ALGIES BAY', 'AOTEA', 'ARARIMU', 'ARCH HILL', 'ARDMORE', 'ARKLES  BAY', 'ARKLES BAY', 'ARMY BAY', 'AUCKLAND CENTRAL', 'AVONDALE', 'AVONDALE NORTH', 'AVONDALE SOUTH', 'AVONDALE WEST', 'AWANUI', 'AWHITU', 'BALMORAL', 'BAYSWATER', 'BAYVIEW', 'BEACH HAVEN', 'BEACH HAVEN NORTH', 'BEACH HAVEN SOUTH', 'BEACHLANDS', 'BELLEVUE', 'BELMONT', 'BETHELLS BEACH', 'BIRKDALE', 'BIRKDALE NORTH', 'BIRKDALE SOUTH', 'BIRKENHEAD', 'BIRKENHEAD POINT', 'BLOCKHOUSE BAY', 'BOMBAY', 'BOTANY', 'BOTANY DOWNS', 'BROOKBY', 'BROWNS BAY', 'BUCKLANDS BEACH', 'BURSWOOD', 'CAMPBELLS BAY', 'CASTOR BAY', 'CHAPEL DOWNS', 'CHATSWOOD', 'CHELTENHAM', 'CLAUDELANDS', 'CLENDON PARK', 'CLEVEDON', 'CLOVER PARK', 'COATESVILLE', 'COCKLE BAY', 'CONIFER GROVE', 'CROWN HILL', 'DAIRY FLAT', 'DANNEMORA', 'DARGAVILLE', 'DEVONPORT', 'DOME FOREST', 'EAST TAMAKI', 'EASTERN BEACH', 'EDEN TERRACE', 'ELLERSLIE', 'EPSOM', 'EPSOM NORTH', 'EPSOM SOUTH', 'FAIRVIEW HEIGHTS', 'FARM COVE', 'FAVONA', 'FLAT BUSH', 'FORREST HILL', 'FREEMANS BAY', 'GLEN EDEN', 'GLEN INNES', 'GLENBROOK', 'GLENDENE', 'GLENDOWIE', 'GLENFIELD', 'GOLFLANDS', 'GOODWOOD HEIGHTS', 'GRAFTON', 'GREAT BARRIER ISLAND', 'GREEN BAY', 'GREENHITHE', 'GREENLANE', 'GREENWOODS CORNER', 'GREY LYNN', 'GULF HARBOUR', 'HALF MOON BAY', 'HATFIELDS BEACH', 'HAURAKI', 'HELENSVILLE', 'HENDERSON', 'HERALD ISLAND', 'HERNE BAY', 'HIGHBURY', 'HIGHLAND PARK', 'HILL PARK', 'HILLCREST', 'HILLSBOROUGH', 'HOBSONVILLE', 'HOKIANGA', 'HOMAI', 'HOWICK', 'HUAPAI', 'HUIA', 'HUNTERS CORNER', 'HUNTINGTON', 'HUNTINGTON PARK', 'HUNTLY', 'HUNUA', 'ISLAND BLOCK', 'JACKS BAY', 'KAEO', 'KAIAUA', 'KAIKOHE', 'KAINGAROA', 'KAIPARA FLATS', 'KAITAIA', 'KAIWAKA', 'KAMO', 'KAMO EAST', 'KAMO WEST', 'KARAKA', 'KARAKA SOUTH', 'KAREKARE', 'KARIKARI', 'KARIKARI PENINSULA', 'KARORI', 'KAUKAPAKAPA', 'KAWAU ISLAND', 'KELSTON', 'KERIKERI', 'KINGSLAND', 'KOHIMARAMA', 'KONINI', 'KUMEU', 'LAINGHOLM', 'LINCOLN', 'LONG BAY', 'LONGFORD PARK', 'LYNFIELD', 'MAHIA PARK', 'MAIRANGI BAY', 'MANGAWHAI', 'MANGERE', 'MANGERE BRIDGE', 'MANGERE EAST', 'MANLY', 'MANUKAU', 'MANUKAU HEIGHTS', 'MANUREWA', 'MANUREWA EAST', 'MARAETAI', 'MARLBOROUGH', 'MASSEY', 'MAUNGATOROTO', 'MAUNU', 'MCLAREN PARK', 'MEADOWBANK', 'MEADOWLANDS', 'MEADOWOOD', 'MELLONS BAY', 'MIDDLEMORE', 'MILFORD', 'MILLWATER', 'MISSION BAY', 'MORNINGSIDE', 'MOUNT ALBERT', 'MOUNT EDEN', 'MOUNT ROSKILL', 'MOUNT WELLINGTON', 'MURRAYS BAY', 'NARROW NECK', 'NEW LYNN', 'NEW PLYMOUTH', 'NEW WINDSOR', 'NEWMARKET', 'NEWTON', 'NGUNGURU', 'NORTH HARBOUR', 'NORTH PARK', 'NORTHCOTE', 'NORTHCOTE CENTRAL', 'NORTHCOTE POINT', 'NORTHCROSS', 'OKURA', 'ONE TREE HILL', 'ONEHUNGA', 'OPAHEKE', 'OPONONI', 'ORAKEI', 'ORANGA', 'ORATIA', 'ORERE POINT', 'OREWA', 'OTAHUHU', 'OTARA', 'OTEHA', 'OWAIRAKA', 'PAHUREHURE', 'PAKURANGA', 'PANMURE', 'PAPAKURA', 'PAPATOETOE', 'PARAKAI', 'PAREMOREMO', 'PARNELL', 'PATUMAHOE', 'PENROSE', 'PIHA', 'PINEHILL', 'POINT CHEVALIER', 'POINT ENGLAND', 'POINT WELLS', 'PONSONBY', 'PORCHESTER PARK', 'PORT ALBERT', 'PUHINUI', 'PUHOI', 'RANDWICK PARK', 'RANUI', 'RED BEACH', 'RED HILL', 'REMUERA', 'RICHMOND PARK', 'RIVERHEAD', 'ROSEDALE', 'ROSEHILL', 'ROTHESAY BAY', 'ROYAL HEIGHTS', 'ROYAL OAK', 'RUATANGATA', 'SAINT HELIERS', 'SAINT JOHNS', 'SAINT MARYS BAY', 'SANDRINGHAM', 'SANDSPIT', 'SCHNAPPER ROCK', 'SHELLY BEACH', 'SHELLY PARK', 'SILKWOOD HEIGHTS', 'SILVERDALE', 'SOMERVILLE', 'STANLEY BAY', 'STANLEY POINT', 'STANMORE BAY', 'STONEFIELDS', 'SUNNYHILLS', 'SUNNYNOOK', 'SUNNYVALE', 'SWANSON', 'TAKANINI', 'TAKAPUNA', 'TAMAKI', 'TARADALE', 'TE ATATU', 'TE ATATU PENINSULA', 'TE ATATU SOUTH', 'THE GARDENS', 'THREE KINGS', 'TIKIPUNGA', 'TINDALLS BAY', 'TITIRANGI', 'TORBAY', 'TOTARA HEIGHTS', 'TOTARA VALE', 'Unkown', 'UNSWORTH HEIGHTS', 'WAIAKE', 'WAIHEKE', 'WAIKOWHAI', 'WAIMAUKU', 'WAIRAU VALLEY', 'WAIUKU', 'WAIWERA', 'WARKWORTH', 'WATERVIEW', 'WATTLE COVE', 'WATTLE DOWNS', 'WELLSFORD', 'WEST HARBOUR', 'WESTERN HEIGHTS', 'WESTERN SPRINGS', 'WESTGATE', 'WESTLAKE', 'WESTMERE', 'WEYMOUTH', 'WHAKATANE', 'WHAKATIWAI', 'WHANGAPARAOA', 'WHANGAREI', 'WHANGAREI HEADS', 'WHAREORA', 'WHENUAPAI', 'WHITFORD', 'WINDSOR PARK', 'WIRI', 'MOUNT MANGANUI, BAY OF PLENTY', 'GLENBERVIE, NEW ZEALAND', 'MATAPOURI, NORTHLAND', 'MAUNGATAPERE, NORTHLAND', 'PAIHIA, NORTHLAND', 'RAWENE, NORTHLAND', 'WAIMAMAKU, NORTHLAND', 'MEREMERE, WAIKATO', 'MORRINSVILLE, WAIKATO', 'TE KAUWHATA, WAIKATO', 'KENSINGTON, WHANGAREI', 'ONE TREE POINT, WHANGAREI', 'ONERAHI, WHANGAREI', 'RAUMANGA, WHANGAREI', 'RAUMAUNGA, WHANGAREI']

以下是坏数据输入郊区的一小部分示例:

['STONFIELDS', 'MT WELLINGTON', 'RD4 ALBANY', 'HAURAKI, NSC', 'GOODWOOD HGTS', 'TE ATATU STH',  'BROWNS BAY NSC', 'AUCKLAND CETNRAL', '    KAMO', 'POINT CHEVALIER SOUTH', 'UNSWORTH HGTS',  'DAIRY FLAT (NORTH SHORE MAIL CENTRE)', 'MANUREWA  (GOODWOOD HGTS)', 'PAKURANG HEIGHTS', 'MANGRE BRIDGE, MANUKAU', 'STANMORE BAY,WHANGAPARAOA', 'SUNNYNOOK, NORTH SHORE', 'SURFDALAE, WAIHEKE  ISLAND', 'ST HELEIRS', 'HENDESON', 'STAAMORE BAY', 'PT CHEVILIER', 'KARAKA PUKEKOHE', 'ALFRISTON  MANUREWA', 'MOUNT ALBERT (SYMONDS ST)', 'STANMOREBAY', 'UNSWORTH HEIGHTS GLENFIELD', 'TE ATATU PENINSULA, WAITAKERE CITY', 'REMUERA         (REMUERA)', 'DANNEMORA EASTTAMAKI HEIGHTS', 'ST JOHNS, REMUERA', 'TAKAPUNA HAURAKI', 'MANUKAU HTS', 'MANUREWA   MANUKAU', 'ORITIA', 'ONE TREE HIL', 'AVONDLAE NORTH', 'OENRAHI', 'SOMMERVALE', 'GLEN EDEN CENTRAL , WAITAKERE', 'MENGERE', 'MOUNT WELLINGTON  (PAKURANGA)', 'WEST HABRBOUR', 'MTROSKILL', 'WATTLKE DOWNS', 'SILVRDALE', 'MT WELLINGTON  CENTRAL', 'POINT CHVEVALIER', 'DAIRY FLAT, RD4 ALBANY', 'PAKIRI, WELLSFORD', 'BEACHLANDS (BOTANY)', 'ROTHESAY BAY, NSC', 'GRAFTON (SYMONDS STREET)', 'OTATHUHU', 'ST MARYS BAY (MATAKANA)', 'OTARA SOUTH  MANUKAU', 'HOMAI, MANUKAU MANUREWA', 'FLATBUSH: MANUKAU', 'ROEHESAY BAY']

感激地接受任何想法!
要明确 - 我希望人们会建议我应该首先尝试哪些功能或库。

1 个答案:

答案 0 :(得分:1)

我过去曾使用过fuzzywuzzy Python库来模糊'匹配字符串。这意味着您可以匹配两个字符串,达到一定的准确度(比如说95%)

https://github.com/seatgeek/fuzzywuzzy

https://marcobonzanini.com/2015/02/25/fuzzy-string-matching-in-python/

我不会尝试重新发明轮子并编写自己的匹配算法。如果这不是您所需要的,那么还有许多其他库在做类似的事情。