无法获取字符串python2.7的唯一ID

时间:2016-09-04 12:26:19

标签: python string python-2.7 hash unique

我正在尝试从单词列表中创建唯一ID。我希望这些数字是全球唯一的。例如,如果出现另一个列表,我希望唯一ID相同,例如对于"密度",ID可能是 @Override protected void onPostExecute(String s) { if(s.equals(SUCCESS_FETCH_ALL)){ mList.setAdapter(adapter); } else if (s.equals(SUCCESS_INSERT)) mActivity.startActivity(new Intent(mContext, MainActivity.class)); } ,如果"密度"这将是相同的。发生在不同的列表中。

正如您所看到的,我当前的方法无法使用151111911id - intern的ID与rrb完全相同。

lrb

我做错了什么?我需要将它们转换为浮点数或数字的原因是上面的句子会进入需要使用数字/向量化特征的分类器。

4 个答案:

答案 0 :(得分:2)

来自docs

  

Interned字符串不是不朽的(就像以前在Python 2.2及以前一样);你必须保持对intern()的返回值的引用才能从中受益。

当下一个字符串被中断时,可以删除先前的字符串,并且新的字符串可能偶尔获得相同的id。因此,请将引用保存在容器中。我会用dict:

featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot', u'south', u'africa', u'respective', u'population', u'density', u'lrb', u'capita', u'per', u'square', u'kilometer', u'rrb', u'global', u'rank', u'number_slot', u'years', u'growthguinea', u'bissau', u'population', u'density', u'positive', u'growth', u'lrb', u'rrb', u'last', u'years', u'lrb', u'rrb', u'LOCATION_SLOT~-appos+LOCATION~-prep_of', u'LOCATION~-prep_of+that~-prep_to', u'that~-prep_to+similar~prep_with', u'similar~prep_with+density~prep_of', u'density~prep_of+NUMBER~appos', u'NUMBER~appos+NUMBER~amod', u'NUMBER~amod+NUMBER_SLOT']

# dict of id:featureVal pairs 
seen = {}

for featureID,featureVal in enumerate(featureList):
    print "featureID is",featureID
    print "featureVal is ",featureVal
    interned = intern(str(featureVal.encode("utf-8")))
    interned_id = id(interned)

    # ensure that no other string with the same id has been seen
    assert interned_id not in seen or seen[interned_id] == featureVal

    # change this to seen[interned_id] = None and you'll (probably) get AssertionError
    # from the line above
    seen[interned_id] = interned

    print "Encoded feature value is", interned_id

答案 1 :(得分:1)

您可以使用单词本身,单词的哈希值,甚至可以将字符串转换为数字。

答案 2 :(得分:1)

也许最简单的方法是使用defaultdict itertools.count float作为起始位置,例如:

from collections import defaultdict
from itertools import count

# Start from 1.0 and increment by one - can change to start from any value or even add a step
# eg: `count(716345.0, 9)` will start at at 716345.0 and increment by 9 for new keys
unique_id = defaultdict(lambda c=count(1.0): next(c))
featureList = [u'guinea', u'bissau', u'compared', u'countriesthe', u'population', u'density', u'guinea', u'bissau', u'similar', u'iran', u'afghanistan', u'cameroon', u'panama', u'montenegro', u'guinea', u'belarus', u'palau', u'location_slot']
for feature in featureList:
    print(feature, unique_id[feature])

打印:

guinea 1.0
bissau 2.0
compared 3.0
countriesthe 4.0
population 5.0
density 6.0
guinea 1.0
bissau 2.0
similar 7.0
iran 8.0
afghanistan 9.0
cameroon 10.0
panama 11.0
montenegro 12.0
guinea 1.0
belarus 13.0
palau 14.0
location_slot 15.0

我们可以做其他几项检查:

unique_id['cameroon'] 
# 10.0
unique_id['this is new']
# 16.0

答案 3 :(得分:-1)

您可以直接在Python中使用hash()函数。散列函数将返回一个唯一的散列,可以将其用作任何给定字符串的ID,但在不同的平台上可能会有所不同(32位/ 64位,操作系统,python版本)

hash("answer")
-8597262460139880008

如果你想要哈希值相同,那么你可以使用Pythons hashlibs模块但不会给你数字。它将返回一个哈希字符串。

import hashlib
test = hashlib.sha224()
test.update("HI How are you")
test.hexdigest()
'3284ec5f391e0c6b4f974d3bc317a77bb50875081d2bcb2436fc2001'

您可以选择各种算法

 hashlib.algorithms
 ('md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512')