How to compare Twitter Ids in python

时间:2018-06-04 17:44:04

标签: python string python-2.7

I'm working with twitter ids which are strings because they are so huge.

Twitter's api has a "Since_id" and I want to search tweets since the earliest tweet in a list.

For example:

tweet_ids = [u'1003659997241401843', u'1003659997241401234234', u'100365999724140136236'] # etc
since_id = min(tweet_ids)

So far min(tweet_ids) works but I want to understand why it works because I want to know if it is just by chance that it worked on the few samples I gave it, or if it is guaranteed to always work.

Edit: To clarify I need to get the lowest tweet id. How do I get the lowest tweet id if they are strings that are > 2^32-1 and therefore can't be represented as integers in python 2.7 on a 32 bit machine.

I am using python 2.7 if that matters

2 个答案:

答案 0 :(得分:0)

Python will compare these strings exactly as it compares any other strings; that is, it will compare them lexicographically.

Thus, it will put 12 before 2, which may be undesirable for you.

Here's a function that will compute the numerical minimum of strings representing integers for you.

# A is an iterable of strings representing integers.
def numerical_min(A):
    cur_min = A[0]
    for x in A[1:]:
        if len(x) < len(cur_min):
            cur_min = x
            continue
        if len(x) > len(cur_min):
            continue
        for m,n in zip(x, cur_min):
            if int(m) < int(n):
                cur_min = x
                break
    return cur_min

答案 1 :(得分:0)

From the Python Documentation, it implies that all Strings, including your case where the strings are large sequences of digits, are compared lexicographically.

  • The "lesser integer" string 2 is less than then "greater integer" string 100 in this case.
  • Negative integers sorted lexicographically are "greater" than positive integers. "-1" is greater than "99" when compared this way because the minus hyphen is lexicographically greater than all digits.
  • Equal integers "2" and "02" aren't necessarily equal in terms of string comparison. "02" is less than "2" string-wise because of the leading zero.

It is better to convert the str into a long int, and then compare it. As in

  • tweet_ids = [long('1003659997241401843'), long('1003659997241401234234'), long('100365999724140136236')]
  • since_id = min(tweet_ids)

Since JSON does not allow 70-bit long ints, convert the smallest int back into a str. Replace the since_id line with

  • since_id = min(tweet_ids, key=int)
相关问题