如何展开多级熊猫数据框

时间:2021-02-19 16:43:23

标签: python pandas dataframe

我有一个包含三列的 Pandas 数据框,前两列是因子,第三列包含计数。我想“爆炸”或“展开”数据框,以便第一列、第二列的每个唯一元素没有一行,而是行数等于计数列的总和,其中每条新行都有一个唯一且递增的标识符号,但我希望两列之一中的每个级别都有一个单独的计数器。请注意,这个问题类似于我昨天问的 How can I 'unroll' a pandas dataframe?,但有一些额外的并发症,我第一次没有意识到,我无法概括(为我自己)如何扩展它。

这是数据框

data = [['van', 'bc', 1], ['abb', 'bc', 3], ['vic','bc',3], ['cal', 'ab', 1], ['edm', 'ab', 2], ['cal','ab', 2], ['van', 'bc', 1]]
df = pd.DataFrame(data, columns = ['city', 'state', 'count']) 

我想把它变成这个


data = [['van', 'bc', 'dr0001'], ['abb', 'bc', 'dr0002'], ['abb', 'bc', 'dr0003'], ['abb', 'bc', 'dr0004'],  ['vic', 'bc', 'dr0005'], ['vic', 'bc', 'dr0006'], ['vic', 'bc', 'dr0007'], ['cal', 'ab', 'dr0001'], ['edm', 'ab', 'dr0002'], ['edm', 'ab', 'dr0003'], ['edm', 'ab', 'dr0004'], ['edm', 'ab', 'dr0005'], ['van', 'bc', 'dr0008']]
df = pd.DataFrame(data, columns = ['city', 'state', 'id'])

谢谢

2 个答案:

答案 0 :(得分:5)

试试这个,我认为你需要一个额外的 groupby 和一些格式来查看你的输出:

import Foundation
#if canImport(FoundationNetworking)
import FoundationNetworking
#endif

var semaphore = DispatchSemaphore (value: 0)

var request = URLRequest(url: URL(string: "http://a542cd3116ed.ngrok.io/api/v1/public/location/66.68994/10.249066/50")!,timeoutInterval: Double.infinity)
request.httpMethod = "GET"

let task = URLSession.shared.dataTask(with: request) { data, response, error in 
  guard let data = data else {
    print(String(describing: error))
    semaphore.signal()
    return
  }
  print(String(data: data, encoding: .utf8)!)
  semaphore.signal()
}

task.resume()
semaphore.wait()

jvm.cpu_load.process
jvm.thread_count
jvm.non_heap_memory
jvm.heap_memory_max

答案 1 :(得分:4)

  • 生成 list 然后 explode()
  • id 状态递增,因此在 DF 形状正确后生成此
data = [['van', 'bc', 1], ['abb', 'bc', 3], ['vic','bc',3], ['cal', 'ab', 1], ['edm', 'ab', 2], ['cal','ab', 2], ['van', 'bc', 1]]
df = pd.DataFrame(data, columns = ['city', 'state', 'count']) 

# first pass, explode
df2 = (df.assign(id=df["count"].apply(lambda n: [f"dr{i+1:05}" for i in range(n)]))
       .explode("id")
       .drop(columns="count").reset_index(drop=True))

# ids increment within state
df2["id"] = df2.groupby("state")["id"].transform(lambda s: [f"dr{i+1:05}" for i,v in enumerate(s)])

输出

<头>
城市 state id
0 van bc dr00001
1 abb bc dr00002
2 abb bc dr00003
3 abb bc dr00004
4 vic bc dr00005
5 vic bc dr00006
6 vic bc dr00007
7 cal ab dr00001
8 edm ab dr00002
9 edm ab dr00003
10 cal ab dr00004
11 cal ab dr00005
12 van bc dr00008