标签: python pandas dataframe


         InvoiceID PayerAccountId  ... user:Project user:Purpose
0        314758801   123456789012  ...          NaN          NaN
1        314758801   123456789012  ...          NaN          NaN
2        314758801   123456789012  ...          NaN          NaN
3        314758801   123456789012  ...          NaN          NaN
4        314758801   123456789012  ...          NaN          NaN
...            ...            ...  ...          ...          ...
1726119        NaN   123456789012  ...          NaN          NaN
1726120        NaN   123456789012  ...          NaN          NaN
1726121        NaN   123456789012  ...          NaN          NaN
1726122        NaN   123456789012  ...          NaN          NaN
1726123        NaN   123456789012  ...          NaN          NaN

[1726124 rows x 27 columns]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1726124 entries, 0 to 1726123
Data columns (total 27 columns):
InvoiceID                        object
PayerAccountId                   object
LinkedAccountId                  object
RecordType                       object
ProductName                      object
RateId                           object
SubscriptionId                   object
UsageType                        object
Operation                        object
AvailabilityZone                 object
ReservedInstance                 object
ItemDescription                  object
UsageStartDate                   datetime64[ns]
UsageEndDate                     datetime64[ns]
UsageQuantity                    float64
BlendedRate                      float64
BlendedCost                      float64
UnBlendedRate                    float64
UnBlendedCost                    float64
ResourceId                       object
aws:cloudformation:stack-name    object
user:Cost                        object
user:CostNo                      object
user:Dept                        object
user:Name                        object
user:Project                     object
user:Purpose                     object
dtypes: datetime64[ns](2), float64(5), object(20)
memory usage: 355.6+ MB

我想使用 object 类型的列设置索引:

Index(['InvoiceID', 'PayerAccountId', 'LinkedAccountId', 'RecordType',
       'ProductName', 'RateId', 'SubscriptionId', 'UsageType', 'Operation',
       'AvailabilityZone', 'ReservedInstance', 'ItemDescription', 'ResourceId',
       'aws:cloudformation:stack-name', 'user:Cost', 'user:CostNo',
       'user:Dept', 'user:Name', 'user:Project', 'user:Purpose'],

然后我想获取 float 类型的 sum UsageEndDate - UsageStartDate sum 类型,达到那个?预先感谢。

感谢@Joshua Maerker的帮助。您的代码启发了我。因此,最终的解决方案是在这里:

import pandas as pd
import numpy as np

# Define the columns data type
data_type = {
    "UsageStartDate": "datetime64[ns]",
    "UsageEndDate": "datetime64[ns]",
    "UsageQuantity": np.float,
    "BlendedRate": np.float,
    "BlendedCost": np.float,
    "UnBlendedRate": np.float,
    "UnBlendedCost": np.float

df = pd.read_csv("data.csv", dtype=np.object)

# Drop the useless columns
list_drop = ["RecordId", "PricingPlanId"]
df.drop(columns=list_drop, inplace=True)

# Change the type of some column
for k, v in data_type.items():
    df[k] = df[k].astype(v)

# Get the unique attributes
df1 = df.drop(columns=list(data_type.keys())).drop_duplicates().reset_index(drop=True)

# Add the auxiliary column
df["Auxiliary"] = df[df1.columns].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
df1["Auxiliary"] = df1[df1.columns].apply(lambda row: ''.join(row.values.astype(str)), axis=1)

# Add the duration column
df["Duration"] = df["UsageEndDate"] - df["UsageStartDate"]

# Structure the rules for grouped to apply
agg = {
    "UsageQuantity": "sum",
    "Duration": "sum",
    "BlendedCost": "sum",
    "UnBlendedCost": "sum",

# Get the result
result = df.groupby("Auxiliary", sort=False).agg(agg)

# Combine the result
cleaned = pd.merge(df1, result, how="inner", on="Auxiliary")

# Drop auxiliary column
df = cleaned.drop(columns="Auxiliary")

# Transfer the result into mysql database
df.to_sql(name="cleaned_result", con=engine, if_exists="replace", index=False)


# select all columns with object Type
dtypOpj = df.select_dtypes(include=['object'])

# create a new column 
df['indexstring'] = ""

# iterate over all Columns with object Type
for column in dtypOpj.columns:
    df['indexstring'] = df['indexstring'] + df[column] 

# Set new Column as Index
df = df.set_index('indexstring')

# select all float types
dtypFloat = df.select_dtypes(include=['float64', 'float32'])

# sum of all Float Columns
sumFloats = df[dtypFloat.columns].sum()

# sum of UsageEndDate - UsageStartDate
df['sumDifference'] = df["UsageEndDate"] - df["UsageStartDate"]