我有一个script
,它根据columns
pandas
中的两个df
分配一个值。下面的代码能够实现第一步,但是我正在努力进行第二步。
因此脚本最初应该:
1)为Person
中的每个string
和[Area]
中的第一个3 unique values
分配一个[Place]
2)希望重新分配少于People
的{{1}}
示例。下面的3 unique values
在df
和6 unique values
中具有[Area]
。但是[Place]
被分配了。理想情况下,3 People
个人将2
每人
2 unique values
输出:
d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','10:15:00','10:15:00','11:48:00','12:00:00','12:10:00'],
'Place' : ['House 1','House 2','House 1','House 3','House 4','House 5','House 1','House 1'],
'Area' : ['X','X','Y','X','X','X','X','X'],
})
df = pd.DataFrame(data=d)
def g(gps):
s = gps['Place'].unique()
d = dict(zip(s, np.arange(len(s)) // 3 + 1))
gps['Person'] = gps['Place'].map(d)
return gps
df = df.groupby('Area', sort=False).apply(g)
s = df['Person'].astype(str) + df['Area']
df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')
如您所见,第一步工作正常。或 Time Place Area Person
0 8:03:00 House 1 X Person 1
1 8:17:00 House 2 X Person 1
2 8:20:00 House 1 Y Person 2
3 10:15:00 House 3 X Person 1
4 10:15:00 House 4 X Person 3
5 11:48:00 House 5 X Person 3
6 12:00:00 House 1 X Person 1
7 12:10:00 House 1 X Person 1
中的每个string
,[Area]
中的第一个3 unique values
被分配给[Place]
。这将Person
保留为Person 1
,将3 values
保留为Person 2
,将1 value
保留为Person 3
。
第二步是我努力的地方。
如果为2 values
分配的少于Person
,请对其进行更改,以使每个3 unique values
最多拥有Person
预期输出:
3 unique values
说明:
Time Place Area Person
0 8:03:00 House 1 X Person 1
1 8:17:00 House 2 X Person 1
2 8:20:00 House 1 Y Person 2
3 10:15:00 House 3 X Person 1
4 10:15:00 House 4 X Person 2
5 11:48:00 House 5 X Person 2
6 12:00:00 House 1 X Person 1
7 12:10:00 House 1 X Person 1
已将Person 1
分配给所有商品。 3 unique values
和Person 2
的数量较少,因此我们应该结合使用。所有重复的值应保持不变。
答案 0 :(得分:4)
以下,我在代码的最后几行之前添加了几行:
d = ({'Time': ['8:03:00', '8:17:00', '8:20:00', '10:15:00', '10:15:00', '11:48:00', '12:00:00', '12:10:00'],
'Place': ['House 1', 'House 2', 'House 1', 'House 3', 'House 4', 'House 5', 'House 1', 'House 1'],
'Area': ['X', 'X', 'Y', 'X', 'X', 'X', 'X', 'X']})
df = pd.DataFrame(data=d)
def g(gps):
s = gps['Place'].unique()
d = dict(zip(s, np.arange(len(s)) // 3 + 1))
gps['Person'] = gps['Place'].map(d)
return gps
df = df.groupby('Area', sort=False).apply(g)
s = df['Person'].astype(str) + df['Area']
# added lines
t = s.value_counts()
df_sub = df.loc[s[s.isin(t[t < 3].index)].index].copy()
df_sub["tag"] = df_sub["Place"] + df_sub["Area"]
tags = list(df_sub.tag.unique())
f = lambda x: f'R{int(tags.index(x) / 3) + 1}'
df_sub['reassign'] = df_sub.tag.apply(f)
s[s.isin(t[t < 3].index)] = df_sub['reassign']
df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')
说实话,我不确定在所有情况下都可以使用它,但是它可以在测试用例中提供预期的输出。
让我们看看我是否能够在有限程度上了解您的尝试。
您有顺序数据(我将其称为事件),并且您想为每个事件分配一个“人”标识符。您将在每个连续事件中分配的标识符取决于以前的分配,在我看来,它需要受到以下规则的控制才能依次应用:
我认识你:如果存在以下情况,我可以重复使用以前的标识符:给定标识符的“位置”和“区域”的值已经出现相同(有时间要做)。
我不认识您:如果出现以下情况,我将创建新的标识符:出现新的Area值(那么Place和Area扮演不同的角色吗?)
我知道吗?:如果发生以下情况,我可能会重复使用以前使用的标识符:标识符未分配给至少三个事件(如果多个标识符发生这种情况?我会假设我使用了最早的...)。
不,我:如果前面的规则都不适用,我将创建一个新的标识符。
假设以上内容是解决方案的实现:
# dict of list of past events assigned to each person. key is person identifier
people = dict()
# new column for df (as list) it will be appended at the end to dataframe
persons = list()
# first we define the rules
def i_know_you(people, now):
def conditions(now, past):
return [e for e in past if (now.Place == e.Place) and (now.Area == e.Area)]
i_do = [person for person, past in people.items() if conditions(now, past)]
if i_do:
return i_do[0]
return False
def i_do_not_know_you(people, now):
conditions = not bool([e for past in people.values() for e in past if e.Area == now.Area])
if conditions:
return f'Person {len(people) + 1}'
return False
def do_i_know_you(people, now):
i_do = [person for person, past in people.items() if len(past) < 3]
if i_do:
return i_do[0]
return False
# then we process the sequential data
for event in df.itertuples():
print('event:', event)
for rule in [i_know_you, i_do_not_know_you, do_i_know_you]:
person = rule(people, event)
print('\t', rule.__name__, person)
if person:
break
if not person:
person = f'Person {len(people) + 1}'
print('\t', "nah, I don't", person)
if person in people:
people[person].append(event)
else:
people[person] = [event]
persons.append(person)
df['Person'] = persons
输出:
event: Pandas(Index=0, Time='8:00:00', Place='House 1', Area='X', Person='Person 1')
i_know_you False
i_do_not_know_you Person 1
event: Pandas(Index=1, Time='8:30:00', Place='House 2', Area='X', Person='Person 1')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 1
event: Pandas(Index=2, Time='9:00:00', Place='House 1', Area='Y', Person='Person 2')
i_know_you False
i_do_not_know_you Person 2
event: Pandas(Index=3, Time='9:30:00', Place='House 3', Area='X', Person='Person 1')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 1
event: Pandas(Index=4, Time='10:00:00', Place='House 4', Area='X', Person='Person 2')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 2
event: Pandas(Index=5, Time='10:30:00', Place='House 5', Area='X', Person='Person 2')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 2
event: Pandas(Index=6, Time='11:00:00', Place='House 1', Area='X', Person='Person 1')
i_know_you Person 1
event: Pandas(Index=7, Time='11:30:00', Place='House 6', Area='X', Person='Person 3')
i_know_you False
i_do_not_know_you False
do_i_know_you False
nah, I don't Person 3
event: Pandas(Index=8, Time='12:00:00', Place='House 7', Area='X', Person='Person 3')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 3
event: Pandas(Index=9, Time='12:30:00', Place='House 8', Area='X', Person='Person 3')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 3
,最后一个数据框是您想要的:
Time Place Area Person
0 8:00:00 House 1 X Person 1
1 8:30:00 House 2 X Person 1
2 9:00:00 House 1 Y Person 2
3 9:30:00 House 3 X Person 1
4 10:00:00 House 4 X Person 2
5 10:30:00 House 5 X Person 2
6 11:00:00 House 1 X Person 1
7 11:30:00 House 6 X Person 3
8 12:00:00 House 7 X Person 3
9 12:30:00 House 8 X Person 3
备注:请注意,我有意避免使用按操作分组并按顺序处理数据。我认为这种复杂性(并不真正了解您想做什么... )要求采用这种方法。另外,您可以使用上面的相同结构,使规则变得更复杂(真的在扮演角色吗?)。
看看新数据,很明显我不理解您要做什么(特别是,分配似乎没有遵循顺序规则)。 我将有一个适用于您的第二个数据集的解决方案,但是对于第一个数据集它会给出不同的结果。
该解决方案更加简单,将添加一列(以后可以根据需要删除):
df["tag"] = df["Place"] + df["Area"]
tags = list(df.tag.unique())
f = lambda x: f'Person {int(tags.index(x) / 3) + 1}'
df['Person'] = df.tag.apply(f)
在第二个数据集上,它将给出:
Time Place Area tag Person
0 8:00:00 House 1 X House 1X Person 1
1 8:30:00 House 2 X House 2X Person 1
2 9:00:00 House 3 X House 3X Person 1
3 9:30:00 House 1 Y House 1Y Person 2
4 10:00:00 House 1 Z House 1Z Person 2
5 10:30:00 House 1 V House 1V Person 2
在第一个数据集上给出:
Time Place Area tag Person
0 8:00:00 House 1 X House 1X Person 1
1 8:30:00 House 2 X House 2X Person 1
2 9:00:00 House 1 Y House 1Y Person 1
3 9:30:00 House 3 X House 3X Person 2
4 10:00:00 House 4 X House 4X Person 2
5 10:30:00 House 5 X House 5X Person 2
6 11:00:00 House 1 X House 1X Person 1
7 11:30:00 House 6 X House 6X Person 3
8 12:00:00 House 7 X House 7X Person 3
9 12:30:00 House 8 X House 8X Person 3
这与索引2和3上的预期输出不同。此输出符合您的要求吗?为什么不呢?
答案 1 :(得分:3)
据我了解,您对Person分配之前的一切感到满意。因此,这是一个即插即用的解决方案,用于“合并”具有少于3个唯一值的人员,因此每个人最终都会获得3个唯一值,除了最后一个显然(基于您发布的倒数第二个df)(“输出:”)而没有触摸已经具有3个唯一值的值,然后将其他值合并。
编辑:非常简化的代码。同样,以您的df作为输入:
n = 3
df['complete'] = df.Person.apply(lambda x: 1 if df.Person.tolist().count(x) == n else 0)
df['num'] = df.Person.str.replace('Person ','')
df.sort_values(by=['num','complete'],ascending=True,inplace=True) #get all persons that are complete to the top
c = 0
person_numbers = []
for x in range(0,999): #Create the numbering [1,1,1,2,2,2,3,3,3,...] with n defining how often a person is 'repeated'
if x % n == 0:
c += 1
person_numbers.append(c)
df['Person_new'] = person_numbers[0:len(df)] #Add the numbering to the df
df.Person = 'Person ' + df.Person_new.astype(str) #Fill the person column with the new numbering
df.drop(['complete','Person_new','num'],axis=1,inplace=True)
答案 2 :(得分:0)
首先,此答案不符合您仅重新分配剩菜的要求(因此,我不希望您接受它)。就是说,无论如何我都会发布它,因为您的时间窗口限制在熊猫世界中很难解决。也许我的解决方案暂时不会对您有用,但也许稍后;)至少对我来说这是一次学习经历-也许其他人也可以从中受益。
import pandas as pd
from datetime import datetime, time, timedelta
import random
# --- helper functions for demo
random.seed( 0 )
def makeRandomTimes( nHours = None, mMinutes = None ):
nHours = 10 if nHours is None else nHours
mMinutes = 3 if mMinutes is None else mMinutes
times = []
for _ in range(nHours):
hour = random.randint(8,18)
for _ in range(mMinutes):
minute = random.randint(0,59)
times.append( datetime.combine( datetime.today(), time( hour, minute ) ) )
return times
def makeDf():
times = makeRandomTimes()
houses = [ str(random.randint(1,10)) for _ in range(30) ]
areas = [ ['X','Y'][random.randint(0,1)] for _ in range(30) ]
df = pd.DataFrame( {'Time' : times, 'House' : houses, 'Area' : areas } )
return df.set_index( 'Time' ).sort_index()
# --- real code begins
def evaluateLookback( df, idx, dfg ):
mask = df.index >= dfg.Lookback.iat[-1]
personTotals = df[ mask ].set_index('Loc')['Person'].value_counts()
currentPeople = set(df.Person[ df.Person > -1 ])
noAllocations = currentPeople - set(personTotals.index)
available = personTotals < 3
if noAllocations or available.sum():
# allocate to first available person
person = min( noAllocations.union(personTotals[ available ].index) )
else:
# allocate new person
person = len( currentPeople )
df.Person.at[ idx ] = person
# debug
df.Verbose.at[ idx ] = ( noAllocations, available.sum() )
def lambdaProxy( df, colName ):
[ dff[1][colName].apply( lambda f: f(df,*dff) ) for dff in df.groupby(df.index) ]
lookback = timedelta( minutes = 120 )
df1 = makeDf()
df1[ 'Loc' ] = df1[ 'House' ] + df1[ 'Area' ]
df1[ 'Person' ] = None
df1[ 'Lambda' ] = evaluateLookback
df1[ 'Lookback' ] = df1.index - lookback
df1[ 'Verbose' ] = None
lambdaProxy( df1, 'Lambda' )
print( df1[ [ col for col in df1.columns if col != 'Lambda' ] ] )
我的机器上的示例输出如下:
House Area Loc Person Lookback Verbose
Time
2018-09-30 08:16:00 6 Y 6Y 0 2018-09-30 06:16:00 ({}, 0)
2018-09-30 08:31:00 4 Y 4Y 0 2018-09-30 06:31:00 ({}, 1)
2018-09-30 08:32:00 10 X 10X 0 2018-09-30 06:32:00 ({}, 1)
2018-09-30 09:04:00 4 X 4X 1 2018-09-30 07:04:00 ({}, 0)
2018-09-30 09:46:00 10 X 10X 1 2018-09-30 07:46:00 ({}, 1)
2018-09-30 09:57:00 4 X 4X 1 2018-09-30 07:57:00 ({}, 1)
2018-09-30 10:06:00 1 Y 1Y 2 2018-09-30 08:06:00 ({}, 0)
2018-09-30 10:39:00 10 X 10X 0 2018-09-30 08:39:00 ({0}, 1)
2018-09-30 10:48:00 7 X 7X 0 2018-09-30 08:48:00 ({}, 2)
2018-09-30 11:08:00 1 Y 1Y 0 2018-09-30 09:08:00 ({}, 3)
2018-09-30 11:18:00 2 Y 2Y 1 2018-09-30 09:18:00 ({}, 2)
2018-09-30 11:32:00 9 X 9X 2 2018-09-30 09:32:00 ({}, 1)
2018-09-30 12:22:00 5 Y 5Y 1 2018-09-30 10:22:00 ({}, 2)
2018-09-30 12:30:00 9 X 9X 1 2018-09-30 10:30:00 ({}, 2)
2018-09-30 12:34:00 6 X 6X 2 2018-09-30 10:34:00 ({}, 1)
2018-09-30 12:37:00 1 Y 1Y 2 2018-09-30 10:37:00 ({}, 1)
2018-09-30 12:45:00 4 X 4X 0 2018-09-30 10:45:00 ({}, 1)
2018-09-30 12:58:00 8 X 8X 0 2018-09-30 10:58:00 ({}, 1)
2018-09-30 14:26:00 7 Y 7Y 0 2018-09-30 12:26:00 ({}, 3)
2018-09-30 14:48:00 2 X 2X 0 2018-09-30 12:48:00 ({1, 2}, 1)
2018-09-30 14:50:00 8 X 8X 1 2018-09-30 12:50:00 ({1, 2}, 0)
2018-09-30 14:53:00 8 Y 8Y 1 2018-09-30 12:53:00 ({2}, 1)
2018-09-30 14:56:00 6 X 6X 1 2018-09-30 12:56:00 ({2}, 1)
2018-09-30 14:58:00 9 Y 9Y 2 2018-09-30 12:58:00 ({2}, 0)
2018-09-30 17:09:00 2 Y 2Y 0 2018-09-30 15:09:00 ({0, 1, 2}, 0)
2018-09-30 17:19:00 4 X 4X 0 2018-09-30 15:19:00 ({1, 2}, 1)
2018-09-30 17:57:00 6 Y 6Y 0 2018-09-30 15:57:00 ({1, 2}, 1)
2018-09-30 18:21:00 3 X 3X 1 2018-09-30 16:21:00 ({1, 2}, 0)
2018-09-30 18:30:00 9 X 9X 1 2018-09-30 16:30:00 ({2}, 1)
2018-09-30 18:35:00 8 Y 8Y 1 2018-09-30 16:35:00 ({2}, 1)
>>>
注意:
lookback
变量控制着向后看以考虑分配给某人的位置的时间长度Lookback
列显示截止时间evaluateLookback
,其中df
是整个DataFrame,idx
是当前索引/标签,而dfg
是当前行。 lambdaProxy
控制evaluateLookback
的调用。3
,但可以根据需要进行调整lambdaProxy
进行评估,然后在evaluateLookback
内存储和使用该结果
演示输出中有一些有趣的极端情况:10:39:00
,14:48:00
,17:09:00
此外:在大熊猫中看到“功能栏”可能会很有趣,也许具有记忆功能?理想情况下,“人员”列应具有一个函数并根据要求进行计算,可以使用其自己的行或某些可变的窗口视图。有人看到过类似的东西吗?
答案 3 :(得分:0)
第二步如何做?
def reduce_df(df):
values = df['Area'] + df['Place']
df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
person_count = df1.groupby('Person')['Person'].agg('count')
leftover_count = person_count[person_count < 3] # the 'leftovers'
# try merging pairs together
nleft = leftover_count.shape[0]
to_try = np.arange(nleft - 1)
to_merge = (leftover_count.values[to_try] +
leftover_count.values[to_try + 1]) <= 3
to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
to_merge = to_try[to_merge]
merge_dict = dict(zip(leftover_count.index.values[to_merge+1],
leftover_count.index.values[to_merge]))
def change_person(p):
if p in merge_dict.keys():
return merge_dict[p]
return p
reduced_df = df.copy()
# update df with the merges you found
reduced_df['Person'] = reduced_df['Person'].apply(change_person)
return reduced_df
print(
reduce_df(reduce_df(df)) # call twice in case 1,1,1 -> 2,1 -> 3
)
输出:
Area Place Time Person
0 X House 1 8:03:00 Person 1
1 X House 2 8:17:00 Person 1
2 Y House 1 8:20:00 Person 2
3 X House 3 10:15:00 Person 1
4 X House 4 10:15:00 Person 2
5 X House 5 11:48:00 Person 2
6 X House 1 12:00:00 Person 1
7 X House 1 12:10:00 Person 1