我有一个如下所示的DataFrame:
InvoiceID PayerAccountId ... user:Project user:Purpose
0 314758801 123456789012 ... NaN NaN
1 314758801 123456789012 ... NaN NaN
2 314758801 123456789012 ... NaN NaN
3 314758801 123456789012 ... NaN NaN
4 314758801 123456789012 ... NaN NaN
... ... ... ... ... ...
1726119 NaN 123456789012 ... NaN NaN
1726120 NaN 123456789012 ... NaN NaN
1726121 NaN 123456789012 ... NaN NaN
1726122 NaN 123456789012 ... NaN NaN
1726123 NaN 123456789012 ... NaN NaN
[1726124 rows x 27 columns]
信息在这里:
[1726124 rows x 27 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1726124 entries, 0 to 1726123
Data columns (total 27 columns):
InvoiceID object
PayerAccountId object
LinkedAccountId object
RecordType object
ProductName object
RateId object
SubscriptionId object
UsageType object
Operation object
AvailabilityZone object
ReservedInstance object
ItemDescription object
UsageStartDate datetime64[ns]
UsageEndDate datetime64[ns]
UsageQuantity float64
BlendedRate float64
BlendedCost float64
UnBlendedRate float64
UnBlendedCost float64
ResourceId object
aws:cloudformation:stack-name object
user:Cost object
user:CostNo object
user:Dept object
user:Name object
user:Project object
user:Purpose object
dtypes: datetime64[ns](2), float64(5), object(20)
memory usage: 355.6+ MB
我想使用 object
类型的列设置索引:
Index(['InvoiceID', 'PayerAccountId', 'LinkedAccountId', 'RecordType',
'ProductName', 'RateId', 'SubscriptionId', 'UsageType', 'Operation',
'AvailabilityZone', 'ReservedInstance', 'ItemDescription', 'ResourceId',
'aws:cloudformation:stack-name', 'user:Cost', 'user:CostNo',
'user:Dept', 'user:Name', 'user:Project', 'user:Purpose'],
dtype='object')
然后我想获取 float
类型的 sum ,UsageEndDate - UsageStartDate
的 sum 类型,达到那个?预先感谢。
感谢@Joshua Maerker的帮助。您的代码启发了我。因此,最终的解决方案是在这里:
import pandas as pd
import numpy as np
# Define the columns data type
data_type = {
"UsageStartDate": "datetime64[ns]",
"UsageEndDate": "datetime64[ns]",
"UsageQuantity": np.float,
"BlendedRate": np.float,
"BlendedCost": np.float,
"UnBlendedRate": np.float,
"UnBlendedCost": np.float
}
df = pd.read_csv("data.csv", dtype=np.object)
# Drop the useless columns
list_drop = ["RecordId", "PricingPlanId"]
df.drop(columns=list_drop, inplace=True)
# Change the type of some column
for k, v in data_type.items():
df[k] = df[k].astype(v)
# Get the unique attributes
df1 = df.drop(columns=list(data_type.keys())).drop_duplicates().reset_index(drop=True)
# Add the auxiliary column
df["Auxiliary"] = df[df1.columns].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
df1["Auxiliary"] = df1[df1.columns].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
# Add the duration column
df["Duration"] = df["UsageEndDate"] - df["UsageStartDate"]
# Structure the rules for grouped to apply
agg = {
"UsageQuantity": "sum",
"Duration": "sum",
"BlendedCost": "sum",
"UnBlendedCost": "sum",
}
# Get the result
result = df.groupby("Auxiliary", sort=False).agg(agg)
# Combine the result
cleaned = pd.merge(df1, result, how="inner", on="Auxiliary")
# Drop auxiliary column
df = cleaned.drop(columns="Auxiliary")
# Transfer the result into mysql database
df.to_sql(name="cleaned_result", con=engine, if_exists="replace", index=False)
顺便说一句,您的功能来创建辅助列不对我来说有用,也许是由于我的行中有Nan
造成的。
答案 0 :(得分:0)
尝试一下:
# select all columns with object Type
dtypOpj = df.select_dtypes(include=['object'])
# create a new column
df['indexstring'] = ""
# iterate over all Columns with object Type
for column in dtypOpj.columns:
df['indexstring'] = df['indexstring'] + df[column]
# Set new Column as Index
df = df.set_index('indexstring')
# select all float types
dtypFloat = df.select_dtypes(include=['float64', 'float32'])
# sum of all Float Columns
sumFloats = df[dtypFloat.columns].sum()
# sum of UsageEndDate - UsageStartDate
df['sumDifference'] = df["UsageEndDate"] - df["UsageStartDate"]
df['sumDifference'].sum()