Question

我是python的新手。我有一个巨大的dataframe，具有数百万的行和ID。我的数据如下：

Time    ID  X   Y
8:00    A   23  100
9:00    B   24  110
10:00   B   25  120
11:00   C   26  130
12:00   C   27  140
13:00   A   28  150
14:00   A   29  160
15:00   D   30  170
16:00   C   31  180
17:00   B   32  190
18:00   A   33  200
19:00   C   34  210
20:00   A   35  220
21:00   B   36  230
22:00   C   37  240
23:00   B   38  250

我对ID和时间上的数据进行了排序。

Time    ID  X   Y
8:00    A   23  100
13:00   A   28  150
14:00   A   29  160
18:00   A   33  200
20:00   A   35  220
9:00    B   24  110
10:00   B   25  120
17:00   B   32  190
21:00   B   36  230
23:00   B   38  250
11:00   C   26  130
12:00   C   27  140
16:00   C   31  180
19:00   C   34  210
22:00   C   37  240
15:00   D   30  170

，我只想选择ID的“第一个和最后一个”，并消除其余的。结果看起来像这样：

Time    ID  X   Y
8:00    A   23  100
20:00   A   35  220
9:00    B   24  110
23:00   B   38  250
11:00   C   26  130
22:00   C   37  240
15:00   D   30  170

我使用了以下代码：

df = pd.read_csv("data.csv")
g = df.groupby('ID')
g_1 = pd.concat([g.head(1),g.tail(1)]).drop_duplicates().sort_values('ID').reset_index(drop=True)
g_1.to_csv('result.csv')

但是我想在新列中将每一行分配或赋予注释为“第一”和“最后”。
我的预期结果如下：

Time    ID  X   Y   Annotation
8:00    A   23  100 First
20:00   A   35  220 Last
9:00    B   24  110 First
23:00   B   38  250 Last
11:00   C   26  130 First
22:00   C   37  240 Last
15:00   D   30  170

有人可以帮助我吗？请给我建议，谢谢。

Answer 1

排序后无需"Manager must exist"使用groupby

drop_duplicates

Answer 2

您可以使用groupby agg，first和last。非常适合列注释。另外，它可以在原始数据框上使用，因此无需排序

@relation phishing

@attribute having_IP_Address  { -1,1 }
@attribute URL_Length   { 1,0,-1 }
@attribute Shortining_Service { 1,-1 }
@attribute having_At_Symbol   { 1,-1 }
@attribute double_slash_redirecting { -1,1 }
@attribute Prefix_Suffix  { -1,1 }
@attribute having_Sub_Domain  { -1,0,1 }
@attribute SSLfinal_State  { -1,1,0 }
@attribute Domain_registeration_length { -1,1 }
@attribute Favicon { 1,-1 }
@attribute port { 1,-1 }
@attribute HTTPS_token { -1,1 }
@attribute Request_URL  { 1,-1 }
@attribute URL_of_Anchor { -1,0,1 }
@attribute Links_in_tags { 1,-1,0 }
@attribute SFH  { -1,1,0 }
@attribute Submitting_to_email { -1,1 }
@attribute Abnormal_URL { -1,1 }
@attribute Redirect  { 0,1 }
@attribute on_mouseover  { 1,-1 }
@attribute RightClick  { 1,-1 }
@attribute popUpWidnow  { 1,-1 }
@attribute Iframe { 1,-1 }
@attribute age_of_domain  { -1,1 }
@attribute DNSRecord   { -1,1 }
@attribute web_traffic  { -1,0,1 }
@attribute Page_Rank { -1,1 }
@attribute Google_Index { 1,-1 }
@attribute Links_pointing_to_page { 1,0,-1 }
@attribute Statistical_report { -1,1 }
@attribute Result  { -1,1 }


@data
-1,1,1,1,-1,-1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,-1,0,1,-1,1,1,-1,1,0,-1,-1,1,1,0,1,1,1,1,-1,-1,0,-1,1,1,1,-1
1,0,1,1,1,-1,-1,-1,-1,1,1,-1,1,0,-1,-1,-1,-1,0,1,1,1,1,1,-1,1,-1,1,0,-1,-1
1,0,1,1,1,-1,-1,-1,1,1,1,-1,-1,0,0,-1,1,1,0,1,1,1,1,-1,-1,1,-1,1,-1,1,-1
1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1,1
-1,0,-1,1,-1,-1,1,1,-1,1,1,-1,1,0,0,-1,-1,-1,0,1,1,1,1,1,1,1,-1,1,-1,-1,1
1,0,-1,1,1,-1,-1,-1,1,1,1,1,-1,-1,0,-1,-1,-1,0,1,1,1,1,1,-1,-1,-1,1,0,-1,-1
1,0,1,1,1,-1,-1,-1,1,1,1,-1,-1,0,-1,-1,1,1,0,1,1,1,1,-1,-1,0,-1,1,0,1,-1
1,0,-1,1,1,-1,1,1,-1,1,1,-1,1,0,1,-1,1,1,0,1,1,1,1,1,-1,1,1,1,0,1,1
1,1,-1,1,1,-1,-1,1,-1,1,1,1,1,0,1,-1,1,1,0,1,1,1,1,1,-1,0,-1,1,0,1,-1
1,1,1,1,1,-1,0,1,1,1,1,1,-1,0,0,-1,-1,-1,0,1,1,1,1,-1,1,1,1,1,-1,-1,1
1,1,-1,1,1,-1,1,-1,-1,1,1,1,1,-1,-1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,0,-1,-1
-1,1,-1,1,-1,-1,0,0,1,1,1,-1,-1,-1,1,-1,1,1,0,-1,1,-1,1,1,-1,-1,-1,1,0,1,-1
1,1,-1,1,1,-1,0,-1,1,1,1,1,-1,-1,-1,-1,1,1,0,1,1,1,1,-1,-1,0,-1,1,1,1,-1
1,1,-1,1,1,1,-1,1,-1,1,1,-1,1,0,1,1,1,1,0,1,1,1,1,1,-1,1,-1,1,-1,1,1
1,-1,-1,-1,1,-1,0,0,1,1,1,1,-1,-1,0,-1,1,1,0,1,1,1,1,1,-1,-1,-1,1,0,1,-1
1,-1,-1,1,1,-1,1,1,-1,1,1,-1,1,0,-1,-1,-1,-1,0,1,1,1,1,1,-1,0,-1,1,1,-1,-1

Answer 3

尝试：

df.groupby('ID').agg(['first','last'])\
  .stack(1).reset_index()\
  .rename(columns={'level_1':'Annotation'})

输出：

  ID Annotation   Time   X    Y
0  A      first   8:00  23  100
1  A       last  20:00  35  220
2  B      first   9:00  24  110
3  B       last  23:00  38  250
4  C      first  11:00  26  130
5  C       last  22:00  37  240
6  D      first  15:00  30  170
7  D       last  15:00  30  170

如何在Groupby和Concat之后在熊猫中分配数据

3 个答案: