我有dataframe,我需要按“ sev”列按条件1&2,3、4&5进行分组并找到其计数。
有什么方法吗?
我对此感到厌倦,但是在sev列中为每个单独的值给出了
import os
import requests
from time import time
import uuid
from multiprocessing.pool import ThreadPool
main_file_name = 'test1.csv'
my_set = set()
with open(main_file_name, 'r') as f: #read image urls
for row in f:
my_set.add(row.split(',')[2].strip())
def get_url(entry):
path = str(uuid.uuid4()) + ".jpg"
if not os.path.exists(path):
r = requests.get(entry, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
for chunk in r:
f.write(chunk)
start = time()
results = ThreadPool(8).imap_unordered(get_url, my_set)
print(f"Elapsed Time: {time() - start}")
pandas dataFrame-
df.groupby(['sev']).ids.agg('count').to_frame('count').reset_index()
df = pd.DataFrame({'ids': {0: 'D1791272223', 1: 'V25369085223', 2: 'V25117230523', 3: 'V25104327323', 4: 'V24862169823', 5: 'P3944221523', 6: 'V24776335823', 7: 'V24722584123', 8: 'V24716191923', 9: 'V24575876123', 10: 'V24791923'}, 'status': {0: 'Resolved', 1: 'Resolved', 2: 'Resolved', 3: 'Resolved', 4: 'Open', 5: 'Open', 6: 'Closed', 7: 'Resolved', 8: 'Resolved', 9: 'Open', 10: 'Resolved'}, 'action': {0: 'Comment', 1: 'Implementation', 2: 'Comment', 3: 'Implementation', 4: 'Comment', 5: 'Implementation', 6: 'Comment', 7: 'Comment', 8: 'Implementation', 9: 'Comment', 10: 'Implementation'}, 'sev': {0: 3, 1: 2, 2: 1, 3: 3, 4: 4, 5: 4, 6: 3, 7: 2, 8: 2, 9: 1, 10: 5}})
预期产量
| ids | status | action | sev |
|--------------|----------|----------------|-----|
| D1791272223 | Resolved | Comment | 3 |
| V25369085223 | Resolved | Implementation | 2 |
| V25117230523 | Resolved | Comment | 1 |
| V25104327323 | Resolved | Implementation | 3 |
| V24862169823 | Open | Comment | 4 |
| P3944221523 | Open | Implementation | 4 |
| V24776335823 | Closed | Comment | 3 |
| V24722584123 | Resolved | Comment | 2 |
| V24716191923 | Resolved | Implementation | 2 |
| V24575876123 | Open | Comment | 1 |
| V24791923 | Resolved | Implementation | 5 |
答案 0 :(得分:0)
主要问题是您需要将严重性级别汇总为较少的类别,这可以通过pd.cut
来完成,因为sev
是数字,并且您希望以连续的间隔进行。如果不是数字或间隔不连续(例如1&4、2、3&5),则需要df.replace
和映射字典。
然后,可以使用df.pivot_table
或通过groupby/unstack
“手动”完成重塑。我更喜欢groupby
,因为它在其他情况下更灵活。
df['sev_group'] = pd.cut(df['sev'], bins=[0, 2, 3, 5],
labels=['1&2', '3', '4&5'])
summary = df.groupby(['sev_group', 'status']).size().unstack()
# or
# summary = df.pivot_table(values='ids', index='sev_group',
# columns='status', aggfunc='count', fill_value=0)
summary['count'] = summary.sum(axis=1)
summary['Closed/Resolved'] = summary['Closed'] + summary['Resolved']
summary = summary[['count', 'Open', 'Closed/Resolved']]
输出
status count Open Closed/Resolved
sev_group
1&2 5 1 4
3 3 0 3
4&5 3 2 1