我在Jupyter工作,有很多专栏,其中很多是日期。我想创建一个循环,该循环将返回一个新列,其中两个相似名称的列之间存在日期差。
例如:
df['Site Visit ACD']
df['Site Visit ECD']
df['Sold ACD (Loc A)']
df['Sold ECD (Loc A)']
新列将包含列df ['Site Visit Cycle Time'] = ACD和ECD之间的日期差。通常,它将始终是包含“ ACD”的列减去包含“ ECD”的列。我该怎么写?
任何帮助表示赞赏!
答案 0 :(得分:1)
以下代码将执行以下操作:
希望对您有帮助。
import pandas as pd
from fuzzywuzzy import fuzz
name = pd.read_excel('Book1.xlsx', sheet_name='name')
unique = []
for i in name.columns:
for j in name.columns:
if i != j and fuzz.ratio(i, j) > 90 and i+j not in unique:
if 'Site Visit' in i:
name['Site Visit'] = name[i] - name[j]
else:
name['difference between '+i+' and '+j] = name[i] - name[j]
unique.append(j+i)
unique.append(i+j)
print(name)
答案 1 :(得分:1)
通常,它将始终是包含“ ACD”的列减去包含“ ECD”的列。
此答案假设列标题不嘈杂,即它们仅在“ ACD” /“ ECD”中有所不同,并且与之完全相同(包括大写/小写)。还假设始终有一个匹配的列。这段代码不会检查是否覆盖写入日期差的列。
这种方法在线性时间内有效,因为我们对列集进行一次迭代,然后直接按名称访问匹配的列。
test.csv
Site Visit ECD,Site Visit ACD,Sold ECD (Loc A),Sold ACD (Loc A)
2018-06-01,2018-06-04,2018-07-05,2018-07-06
2017-02-22,2017-03-02,2017-02-27,2017-03-02
代码
import pandas as pd
df = pd.read_csv("test.csv", delimiter=",")
for col_name_acd in df.columns:
# Skip columns that don't have "ACD" in their name
if "ACD" not in col_name_acd: continue
col_name_ecd = col_name_acd.replace("ACD", "ECD")
# we assume there is always a matching "ECD" column
assert col_name_ecd in df.columns
col_name_diff = col_name_acd.replace("ACD", "Cycle Time")
df[col_name_diff] = df[col_name_acd].astype('datetime64[ns]') - df[col_name_ecd].astype('datetime64[ns]')
print(df.head())
输出
Site Visit ECD Site Visit ACD Sold ECD (Loc A) Sold ACD (Loc A) \
0 2018-06-01 2018-06-04 2018-07-05 2018-07-06
1 2017-02-22 2017-03-02 2017-02-27 2017-03-02
Site Visit Cycle Time Sold Cycle Time (Loc A)
0 3 days 1 days
1 8 days 3 days