将整个数据集转换为百分比

时间:2019-11-10 05:20:23

标签: python pandas dataframe

我想将整个数据集转换为百分比。

https://cocl.us/datascience_survey_data

要找出该行的百分比总和。

例如大数据(Spark / Hadoop)= 1332 + 729 + 127 = 2188

因此该百分比将非常有趣:60.87%

我想为所有行自动执行此操作。 怎么做?

3 个答案:

答案 0 :(得分:3)

您可以按行将DataFrame.div的列的所有数据除以sum,然后再除以100

df = pd.read_csv('Topic_Survey_Assignment.csv', index_col=0)

df1 = df.div(df.sum(axis=1), axis=0).mul(100)
print (df1)
                            Very interested  Somewhat interested  \
Big Data (Spark / Hadoop)         60.877514            33.318099   
Data Analysis / Statistics        77.007299            20.255474   
Data Journalism                   20.235849            50.990566   
Data Visualization                61.580882            33.731618   
Deep Learning                     58.229599            35.500231   
Machine Learning                  74.724771            21.880734   

                            Not interested  
Big Data (Spark / Hadoop)         5.804388  
Data Analysis / Statistics        2.737226  
Data Journalism                  28.773585  
Data Visualization                4.687500  
Deep Learning                     6.270171  
Machine Learning                  3.394495  

详细信息

print (df.sum(axis=1))
Big Data (Spark / Hadoop)     2188
Data Analysis / Statistics    2192
Data Journalism               2120
Data Visualization            2176
Deep Learning                 2169
Machine Learning              2180
dtype: int64

Numpy替代品非常相似:

df = pd.read_csv('Topic_Survey_Assignment.csv', index_col=0)

arr = df.values
df1 = pd.DataFrame(arr / np.sum(arr, axis=1)[:, None] * 100,
                   index=df.index,
                   columns=df.columns)
print (df1)
                            Very interested  Somewhat interested  \
Big Data (Spark / Hadoop)         60.877514            33.318099   
Data Analysis / Statistics        77.007299            20.255474   
Data Journalism                   20.235849            50.990566   
Data Visualization                61.580882            33.731618   
Deep Learning                     58.229599            35.500231   
Machine Learning                  74.724771            21.880734   

                            Not interested  
Big Data (Spark / Hadoop)         5.804388  
Data Analysis / Statistics        2.737226  
Data Journalism                  28.773585  
Data Visualization                4.687500  
Deep Learning                     6.270171  
Machine Learning                  3.394495  

答案 1 :(得分:1)

最快的选择是使用numpy。无论数据多大,计算都将很快

import numpy as np
#get the values
values = data[['Very interested', 'Somewhat interested', 'Not interested']].values
#get the sum of each row
sums = values.sum(axis=1).T
#reshape the sums for the purposes of division
sums = np.reshape(sums, (-1, 1))
#divide each value with the sum value and multiply with 100
percentages = (values / sums) * 100
#assign the calculatiton back to the original data
data[['Very interested', 'Somewhat interested', 'Not interested']] = percentages
#print the data
print(data)
Unnamed: 0  Very interested Somewhat interested Not interested
0   Big Data (Spark / Hadoop)   60.877514   33.318099   5.804388
1   Data Analysis / Statistics  77.007299   20.255474   2.737226
2   Data Journalism 20.235849   50.990566   28.773585
3   Data Visualization  61.580882   33.731618   4.687500
4   Deep Learning   58.229599   35.500231   6.270171
5   Machine Learning    74.724771   21.880734   3.394495

答案 2 :(得分:0)

import pandas as pd
df= pd.read_csv('filename.csv')
df['very_interested_pct']=(df['Very interested']/(df['Somewhat interested']+df['Very interested']+df['Not interested']))*100

这将创建一个名为very_interested_pct的新列,您可以对其他两列执行相同操作,并删除前几列。