获取不同ID的消息密度

时间:2015-03-20 11:57:25

标签: python

想象一下,有10所房子,可以有一个房间到无数人。这些人中的每一个都发送许多消息,包含他们的用户ID和门牌号码。这可以是1到无限数量的消息。我想知道每个人为每个房子发送的平均消息数量,以便稍后确定哪个房子获得了最多的平均消息。

现在,我已经从概念上解释过,房屋不是房屋,而是纬度,从f.ex -90到-89等。一个人可以从不同的房屋发送信息。

所以我有一个带latitude和senderID的数据库。我想绘制纬度密度pr unique senderID:

在一个区间

的每个纬度

Number of rows/Number of unique userids

这是一个示例输入:

lat = [-83.76, -44.88, -38.36, -35.50, -33.99, -31.91, -27.56, -22.95,
        40.72,  47.59,  54.42,  63.84,  76.77, 77.43, 78.54]

userid= [5, 7, 6, 6, 6, 6, 5, 2,
         2, 2, 1, 5, 10, 9 ,8]

以下是相应的密度:

-80 to -90: 1
-40 to -50: 1
-30 to -40: 4
-20 to -30: 1
  40 to 50: 2
  50 to 60: 1
  60 to 70: 1
  70 to 80: 1

另一个输入:

lat = [70,70,70,70,70,80,80,80]
userid = [1,2,3,4,5,1,1,2]

纬度70的密度为1,而纬度80的密度为1.5。

如果我通过数据库查询/伪代码执行此操作,我会执行以下操作:

SELECT count(latitude) FROM messages WHERE latitude < 79 AND latitude > 69
SELECT count(distinct userid) FROM messages WHERE latitude < 79 AND latitude > 69

密度将为count(latitude)/count(distinct userid) - 也被解释为totalmessagesFromCertainLatitude / distinctUserIds。这将在-90到90之间重复,即-90<latitude<-8989<latitude<90

要获得任何帮助可能是一个很大的延伸,但我不能组织我的想法这样做,而我确信没有错误。我会为任何事情感到高兴。如果我不清楚,我很抱歉。

4 个答案:

答案 0 :(得分:2)

因为它整齐地包装成大熊猫&#39;内置插件,它可能在大型数据集的熊猫中很快。

lat = [-83.76, -44.88, -38.36, -35.50, -33.99, -31.91, -27.56, -22.95,
        40.72,  47.59,  54.42,  63.84,  76.77, 77.43, 78.54]

userid= [5, 7, 6, 6, 6, 6, 5, 2,
         2, 2, 1, 5, 10, 9 ,8]
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from matplotlib.collections import PatchCollection
from math import floor

df = pd.DataFrame(zip(userid,lat), columns = ['userid','lat']
)
df['zone'] = map(lambda x: floor(x) * 10,df.lat/10) # for ten-degree zones
zonewidth=10
#df['zone'] = map(floor, df.lat) # for one-degree zones
#zonewidth=1 # ditto

dfz = df.groupby('zone') #returns a dict of dataframes

#for k, v in dfz: # useful for exploring the GroupBy object
#    print(k, v.userid.values, float(len(v.userid.values))/len(set(v.userid.values))) 

p = [(k, float(len(v.userid.values))/len(set(v.userid.values))) for k, v in dfz]

# plotting could be tightened up -- PatchCollection?  
R = [Rectangle((x, 0), zonewidth, y, facecolor='red', edgecolor='black',fill=True) for x, y in p]
fig, ax = plt.subplots()
for r in R:
    ax.add_patch(r)
plt.xlim((-90, 90))
tall = max([r.get_height() for r in R])
plt.ylim((0, tall + 0.5))
plt.show()

对于第一组测试数据:

enter image description here

答案 1 :(得分:1)

我不是百分百肯定我已经理解了你想要的输出,但是这会产生一个阶梯式,累积的直方图样图,x轴是纬度(分箱),y轴是你定义的密度上方。

从您的示例代码中,您已经安装了numpy,并且很乐意使用它。我将采用的方法是获取两个数据集,而不是像SQL样本返回的那样,然后使用它们来获取密度然后绘制。使用您现有的纬度/用户ID数据格式 - 它可能看起来像这样

编辑: 从此处删除了第一版代码,以及一些在OP澄清和问题编辑后多余的注释


以下评论和OP澄清 - 我认为这是所希望的:

import numpy as np
import matplotlib.pyplot as plt
from itertools import groupby

import numpy as np
import matplotlib.pyplot as plt
from itertools import groupby

def draw_hist(latitudes,userids):
    min_lat = -90
    max_lat = 90
    binwidth = 1

    bin_range = np.arange(min_lat,max_lat,binwidth)

    all_rows = zip(latitudes,userids)
    binned_latitudes = np.digitize(latitudes,bin_range)
    all_in_bins = zip(binned_latitudes,userids)
    unique_in_bins = list(set(all_in_bins))
    all_in_bins.sort()
    unique_in_bins.sort()

    bin_count_all = []
    for bin, group in groupby(all_in_bins, lambda x: x[0]):
        bin_count_all += [(bin, len([k for k in group]))]

    bin_count_unique = []
    for bin, group in groupby(unique_in_bins, lambda x: x[0]):
        bin_count_unique += [(bin, len([ k for k in group]))]        

    # bin_count_all and bin_count_unique now contain the data
    # corresponding to the SQL / pseudocode in your question
    # for each latitude bin

    bin_density = [(bin_range[b-1],a*1.0/u) for ((b,a),(_,u)) in zip(bin_count_all, bin_count_unique)]

    bin_density =  np.array(bin_density).transpose()

    # plot as standard bar - note you can put uneven widths in 
    # as an array-like here if necessary
    # the * simply unpacks the x and y values from the density
    plt.bar(*bin_density, width=binwidth)
    plt.show()
    # can save away plot here if desired


latitudes = [-70.5, 5.3, 70.32, 70.43, 5, 32, 80, 80, 87.3]
userids = [1,1,2,2,4,5,1,1,2]

draw_hist(latitudes,userids)

OP数据集上具有不同bin宽度的示例输出

Output with bin widths 0.1, 1 and 10

答案 2 :(得分:0)

我认为这解决了这个问题,尽管它根本没有效率:

con = lite.connect(databasepath)
binwidth = 1
latitudes = []
userids = []
info = []
densities = []
with con:
    cur = con.cursor()
    cur.execute('SELECT latitude, userid FROM dynamicMessage')
    con.commit()
    print "executed"
    while True:
        tmp = cur.fetchone()
        if tmp != None:
            info.append([float(tmp[0]),float(tmp[1])])
        else:
            break
    info = sorted(info, key=itemgetter(0))
    for x in info:
        latitudes.append(x[0])
        userids.append(x[1])
    x = 0
    latitudecount = 0
    for b in range(int(min(latitudes)),int(max(latitudes))+1):
        numlatitudes = sum(i<b for i in latitudes)
        if numlatitudes > 1:
            tempdensities = latitudes[0:numlatitudes]
            latitudes = latitudes[numlatitudes:]
            tempuserids = userids[0:numlatitudes]
            userids = userids[numlatitudes:]
            density = numlatitudes/len(list(set(tempuserids)))
            if density>1:
                tempdensities = [b]*int(density)
                densities.extend(tempdensities)
    plt.hist(densities, bins=len(list(set(densities))))
    plt.savefig('latlongstats'+'t'+str(time.strftime("%H:%M:%S")), format='png')

答案 3 :(得分:0)

以下内容并不是绘制所需直方图的完整解决方案,但我认为它值得报道

  1. 大部分解决方案,我们扫描元组数组以选择所需范围内的元组并计算

    • 所选元组的数量
    • 唯一ID,使用创建集合的技巧(这会自动丢弃重复项)并计算其数量

    最终我们返回所需的比率,如果不同的ID的数量为零,则返回零

    def ratio(d, mn, mx):
        tmp = [(lat, uid) for lat, uid in d if mn <= lat < mx]
        nlats, nduids = len(tmp), len({t[1] for t in tmp})
        return 1.0*nlats/nduids if nduids>0 else 0
    
  2. 通过zip输入和分配数据到元组列表

    lat = [-83.76, -44.88, -38.36, -35.50, -33.99, -31.91, -27.56, -22.95,
           -19.00, -12.32,  -6.14,  -1.11,   4.40,  10.23,  19.40,  31.18,
            40.72,  47.59,  54.42,  63.84,  76.77]
    userid= [52500.0, 70100.0, 35310.0, 47776.0, 70100.0, 30991.0, 37328.0, 25575.0,
             37232.0,  6360.0, 52908.0, 52908.0, 52908.0, 77500.0,   345.0,  6360.0,
              3670.0, 36690.0,  3720.0,  2510.0,  2730.0]
    data = zip(lat,userid)
    
  3. 准备垃圾箱

    extremes = range(-90,91,10)
    intervals = zip(extremes[:-1],extremes[1:])
    
  4. 实际计算,结果是可以传递给相关float函数的pyplot列表

    ratios = [ratio(data,*i) for i in intervals]
    print ratios
    # [1.0, 0, 0, 0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 0, 1.0, 1.0, 1.0, 1.0, 1.0, 0]