如何从两列数据集中找到最可能的字符串对?

时间:2013-08-27 16:33:24

标签: python hashmap nested

鉴于A列和B列,如何在B列找到A列中每个项目最可能的项目?那么基于嵌套哈希映射的东西呢?我想用Python做到这一点。

INPUT:

a,abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5
a,abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5
a,abd37534c7d9a2efb9465fghfghfghfghfghrewresdasdzfdghhgfhg
a,abd3753dfrtdgfdg563ae98078d6dfgfdgdfghdgasdaSADFBVFDGFD5
b,c681e18b81edaf2b66dd22376734dba5992e362bc3f91ab225854c17

输出:

a,abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5
b,c681e18b81edaf2b66dd22376734dba5992e362bc3f91ab225854c17

1 个答案:

答案 0 :(得分:0)

我将假设“最可能”是指每个{a,b}出现次数最多的那个。

以下内容可能会有效,但可能会出现一些语法问题。在任何情况下,它都会让您了解如何解决问题(如果不能解决问题)。

tupleList = [('a','abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5'),
             ('a','abd37534c7d9a2efb9465de931cd7055ffdb8879563ae98078d6d6d5'),
             ('a','abd37534c7d9a2efb9465fghfghfghfghfghrewresdasdzfdghhgfhg'),
             ('a','abd3753dfrtdgfdg563ae98078d6dfgfdgdfghdgasdaSADFBVFDGFD5'),
             ('b','c681e18b81edaf2b66dd22376734dba5992e362bc3f91ab225854c17')]
# Load your list of a,blah into tupleList
myHashMap = {}
for col1, col2 in tupleList:
  if col1 not in myHashMap:
   myHashMap[col1] = {}
  if col2 not in myHashMap[col1]:
   myHashMap[col1][col2] = 0
  myHashMap[col1][col2] += 1

# Now iterate over to find the one with highest occurrence.
for col in myHashMap:
  maxKey = ''
  maxVal = 0
  for col2 in myHashMap[col1]:
    if myHashMap[col1][col2] > maxVal:
     maxVal = myHashMap[col1][col2]
     maxKey = col2
  print 'Most probable for %s is %s'%(col, maxKey)