如何将两个程序与计划执行合并

时间:2015-07-09 10:28:43

标签: python csv pandas merge

我正在尝试合并两个程序或编写第三个程序,将这两个程序称为函数。它们应该一个接一个地运行,并在一定时间间隔后运行几分钟。像make文件这样的东西,以后会包含更多的程序。我无法将它们合并,也无法将它们放入某种格式,以便我可以在新的main程序中调用它们。

程序_ master_id.py从文件夹位置选择*.csv文件,并在计算后将master_ids.csv文件附加到文件夹的其他位置。

计划_ master_count.pycount除以相应Idstimeseries的计数。

Program_1 master_id.py

import pandas as pd
import numpy as np

# csv file contents
# Need to change to path as the Transition_Data has several *.CSV files

csv_file1 = 'Transition_Data/Test_1.csv' 
csv_file2 = '/Transition_Data/Test_2.csv'

#master file to be appended only

master_csv_file = 'Data_repository/master_lac_Test.csv'

csv_file_all = [csv_file1, csv_file2]

# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path

df_all = [pd.read_csv(csv_file) for csv_file in csv_file_all]

# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')

# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
    return group.fillna(method='ffill').iloc[-1]

# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)

# do the subtraction

df_master = pd.read_csv(master_csv_file, index_col=['Ids']).sort_index()

# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)

# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)

print(df_matched)

Program_2 master_count.py #This does not give any error nor gives any output.

import pandas as pd
import numpy as np

csv_file1 = '/Data_repository/master_lac_Test.csv'
csv_file2 = '/Data_repository/lat_lon_master.csv'

df1 = pd.read_csv(csv_file1).set_index('Ids')

# need to sort index in file 2
df2 = pd.read_csv(csv_file2).set_index('Ids').sort_index()

# df1 and df2 has a duplicated column 00:00:00, use df1 without 1st column
temp = df2.join(df1.iloc[:, 1:])

# do the division by number of occurence of each Ids 
# and add column 00:00:00
def my_func(group):
    num_obs = len(group)
    # process with column name after 00:30:00 (inclusive)
    group.iloc[:,4:] = (group.iloc[:,4:]/num_obs).add(group.iloc[:,3], axis=0)
    return group

result = temp.groupby(level='Ids').apply(my_func)

我正在尝试编写一个main程序,首先调用master_ids.py,然后调用master_count.py。它们是一种在一个程序中合并或将它们作为函数编写并在新程序中调用这些函数的方法吗?请建议。

1 个答案:

答案 0 :(得分:1)

Okey,假设你有program1.py:

import pandas as pd
import numpy as np

def main_program1():
    csv_file1 = 'Transition_Data/Test_1.csv' 
    ...
    return df_matched

然后是program2.py:

import pandas as pd
import numpy as np

def main_program2():
    csv_file1 = '/Data_repository/master_lac_Test.csv'
    ...
    result = temp.groupby(level='Ids').apply(my_func)
    return result

你现在可以在一个单独的python程序中使用它们,比如main.py

import time
import program1 # imports program1.py
import program2 # imports program2.py

df_matched = program1.main_program1()
print(df_matched)
# wait
min_wait = 1
time.sleep(60*min_wait)
# call the second one
result = program2.main_program2()

有很多方法可以“改善”这些,但希望这会向您展示要点。我特别建议你使用What does if __name__ == "__main__": do? 在每个文件中,以便它们可以从命令行轻松执行或从python调用。

另一个选项是shell脚本,对于'master_id.py'和'master_count.py'来说(最简单的形式)

python master_id.py
sleep 60
python master_count.py
保存在'main.sh'中的

可以作为

执行
sh main.sh