熊猫添加新列性能问题

时间:2020-02-01 17:11:56

标签: python pandas

我正在尝试添加2个新列以从完整日期中提取日期和月份,我的问题是当前我的数据集大约有120万条记录,并且预计到年底将超过20 m,并且添加列需要很长时间,所以我想问一下最佳做法。

我正在使用aqlite 这是我的代码

cnx = sqlite3.connect('data/firstline.db')
df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['day'] = pd.DatetimeIndex(df['Open_Date']).day
df['month'] = pd.DatetimeIndex(df['Open_Date']).month

df['Product_Name'].replace('', np.nan, inplace=True)
df['Product_Name'].fillna("N", inplace = True) 

df['product_Type'].replace('', np.nan, inplace=True)
df['product_Type'].fillna("A", inplace = True) 

df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']

非常感谢您的一贯支持:)

1 个答案:

答案 0 :(得分:0)

如果原始DataFrame解决方案中没有丢失的数据,则应该简化一下。

我还认为inplace不是一个好习惯,请检查thisthis

同时合并所有列也是一个不错的解决方案,最快的方法之一,请选中this

df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['Open_Date'] = pd.to_datetime(df['Open_Date'])

df['day'] = df['Open_Date'].dt.day
df['month'] = df['Open_Date'].dt.month

df['Product_Name'] = df['Product_Name'].replace('', 'N')
df['product_Type'] = df['product_Type'].replace('', 'A')


df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']

如果缺少值:

df = pd.read_sql_query("SELECT * FROM firstline_srs", cnx)
df['Open_Date'] = pd.to_datetime(df['Open_Date'])

df['day'] = df['Open_Date'].dt.day
df['month'] = df['Open_Date'].dt.month

df['Product_Name'] = df['Product_Name'].replace('', np.nan).fillna("N")
df['product_Type'] = df['product_Type'].replace('', np.nan).fillna("A")


df['full_path'] = df['Type'] + "/" + df['Area'] + "/" + df['Sub_Area'] + "/" + df['product_Type'] + "/" + df['Product_Name']