有更快的方法吗? (python twitter位置)

时间:2014-12-26 18:39:54

标签: python algorithm twitter python-twitter

我试图返回一个字典,用于汇总最近的州中心的推文。我在所有推文上进行迭代,对于每条推文,我都会检查所有状态,看看哪个州最接近。

什么是更好的方法呢?

def group_tweets_by_state(tweets):
    """

    The keys of the returned dictionary are state names, and the values are
    lists of tweets that appear closer to that state center than any other.

    tweets -- a sequence of tweet abstract data types """


    tweets_by_state = {}
    for tweet in tweets:
        position = tweet_location(tweet)
        min, result_state = 100000, 'CA'
        for state in us_states:
            if geo_distance(position, find_state_center(us_states[state]))< min:
                min = geo_distance(position, find_state_center(us_states[state]))
                result_state = state
        if result_state not in tweets_by_state:
            tweets_by_state[result_state]= []
            tweets_by_state[result_state].append(tweet)
        else:
            tweets_by_state[result_state].append(tweet)
    return tweets_by_state

1 个答案:

答案 0 :(得分:5)

当推文数量非常大时,那个巨大的for循环中的每一个小增强都会导致时间复杂度带来巨大的性能提升,我能想到的东西很少:

1。只需拨打geo_distance()一次,特别是当费用很高时

distance = geo_distance(position, find_state_center(us_states[state]))
if distance < min:
     min = distance

而不是

if geo_distance(position, find_state_center(us_states[state]))< min:
    min = geo_distance(position, find_state_center(us_states[state]))

2。如果职位倾向于经常重复:

position_closest_state = {}  # save known result 
tweets_by_state = {}
for tweet in tweets:
    position = tweet_location(tweet)
    min, result_state = 100000, 'CA'

    if position in position_closest_state:
        result_state = position_closest_state[position]
    else:
        for state in us_states:
            distance = geo_distance(position, find_state_center(us_states[state]))
            if distance < min:
                min = distance
                result_state = state
                position_closest_state[position] = result_state 

所以,假设你有来自200个不同位置的1000条推文,us_states为50,你的原始算法会调用geo_distance() 1000 * 50 * 2次,现在它可以减少到200 * 50 * 1次调用。

3。减少find_state_center()

上的调用次数

与#2类似,现在每个州的每个推文都会冗余调用它。

state_center_dict = {}
for state in us_states:
    state_center_dict[state] = find_state_center(us_states[state])

position_closest_state = {}  # save known result 
tweets_by_state = {}
for tweet in tweets:
    position = tweet_location(tweet)
    min, result_state = 100000, 'CA'

    if position in position_closest_state:
        result_state = position_closest_state[position]
    else:
        for state in us_states:
            distance = geo_distance(position, state_center_dict[state])
            if distance < min:
                min = distance
                result_state = state
                position_closest_state[position] = result_state 

现在find_state_center()只被叫50次(状态数)而不是50 * 1000(推文数量),我们又取得了巨大的进步!

绩效成就总结

通过#1,我们将性能提高了一倍。 #2我们通过(推文数量/位置数)次数来增强它。 #3是三者中最大的一个,与原始代码相比,我们将时间复杂度降低到仅1 /(推文数量)。