我正在使用Pandas填充6个新变量,其值取决于其他数据变量。整个数据集包括大约70万行和14个变量(列),其中包括我新添加的那些。
我的第一种方法是使用itertuples()
,主要是因为这里的经验很少。这大约需要9600秒。
我通过使用apply()设法提高了效率(〜3500秒)。这是其中一个新变量的示例。
housing_df = utils.make_data_frame("data/source_data/housing_with_child.dta", "stata")
""" Populate new variables """
# setup new variables
housing_df["hh_n"] = ""
housing_df["bio_d"] = ""
housing_df["step_d"] = ""
housing_df["child_d"] = ""
housing_df["both_bio"] = ""
housing_df["hhtype"] = ""
housing_df["step"] = ""
# hh_n
def hh_n(row):
df = housing_df.loc[
(housing_df["pidp"] == row["pidp"]) & (housing_df["wave"] == row["wave"])
]
return str(len(df.index) + 1)
housing_df["hh_n"] = housing_df.apply(hh_n, axis=1)
每个新变量都遵循相同的模式,并且需要执行以下操作:
For each row
Get some of the rows existing data (eg pipd and wave)
Find rows in the whole data frame which have the same values (pidp and wave) and put these in a new dataframe
Count the rows we found in the new dataframe
Return the count value for the new variable (hh_n)
这是输出数据的一个小例子:
如果相关,这是我从第一行的stata文件创建数据框的方法:
# Method that returns data frame from stata file or csv
def make_data_frame(data, type):
if type is "stata":
reader = pd.read_stata(data, chunksize=100000)
if type is "csv":
reader = pd.read_csv(data, chunksize=100000)
df = pd.DataFrame()
for itm in reader:
df = df.append(itm)
return df
以下是前50行数据的示例
{'index': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16, 17: 17, 18: 18, 19: 19, 20: 20, 21: 21, 22: 22, 23: 23, 24: 24, 25: 25, 26: 26, 27: 27, 28: 28, 29: 29, 30: 30, 31: 31, 32: 32, 33: 33, 34: 34, 35: 35, 36: 36, 37: 37, 38: 38, 39: 39, 40: 40, 41: 41, 42: 42, 43: 43, 44: 44, 45: 45, 46: 46, 47: 47, 48: 48, 49: 49, 50: 50, 51: 51, 52: 52, 53: 53, 54: 54, 55: 55, 56: 56, 57: 57, 58: 58, 59: 59, 60: 60, 61: 61, 62: 62, 63: 63, 64: 64, 65: 65, 66: 66, 67: 67, 68: 68, 69: 69, 70: 70, 71: 71, 72: 72, 73: 73, 74: 74, 75: 75, 76: 76, 77: 77, 78: 78, 79: 79, 80: 80, 81: 81, 82: 82, 83: 83, 84: 84, 85: 85, 86: 86, 87: 87, 88: 88, 89: 89, 90: 90, 91: 91, 92: 92, 93: 93, 94: 94, 95: 95, 96: 96, 97: 97, 98: 98, 99: 99}, 'pidp': {0: 1367, 1: 2051, 2: 2727, 3: 2727, 4: 2727, 5: 2727, 6: 2727, 7: 3407, 8: 3407, 9: 3407, 10: 3407, 11: 3407, 12: 3407, 13: 3407, 14: 3407, 15: 3407, 16: 3407, 17: 3407, 18: 3407, 19: 3407, 20: 3407, 21: 3407, 22: 3407, 23: 3407, 24: 3407, 25: 3407, 26: 3407, 27: 3407, 28: 3407, 29: 4091, 30: 4091, 31: 4091, 32: 4091, 33: 4091, 34: 4091, 35: 4091, 36: 4091, 37: 4767, 38: 4767, 39: 4767, 40: 4767, 41: 4767, 42: 4767, 43: 5451, 44: 5451, 45: 5451, 46: 5451, 47: 5451, 48: 5451, 49: 6135, 50: 6135, 51: 6135, 52: 6135, 53: 6135, 54: 6807, 55: 6807, 56: 6844, 57: 6844, 58: 6844, 59: 6844, 60: 6844, 61: 6844, 62: 6844, 63: 6844, 64: 6844, 65: 6844, 66: 6844, 67: 6844, 68: 6844, 69: 6844, 70: 6844, 71: 6844, 72: 6844, 73: 6844, 74: 6844, 75: 6844, 76: 6844, 77: 6844, 78: 6844, 79: 6844, 80: 6844, 81: 6844, 82: 6844, 83: 6844, 84: 6844, 85: 6844, 86: 6844, 87: 6844, 88: 6844, 89: 6844, 90: 6844, 91: 6844, 92: 6844, 93: 6844, 94: 6844, 95: 6844, 96: 6844, 97: 6844, 98: 6844, 99: 6844}, 'hidp': {0: 20402, 1: 20402, 2: 5799043, 3: 60159608, 4: 45594010, 5: 53475212, 6: 15449616, 7: 102002, 8: 102002, 9: 30559202, 10: 30559202, 11: 6351204, 12: 6351204, 13: 2590808, 14: 13610, 15: 6812, 16: 13614, 17: 20416, 18: 21902818, 19: 36407220, 20: 36407220, 21: 20424, 22: 20424, 23: 20426, 24: 20426, 25: 9010028, 26: 9010028, 27: 18856430, 28: 18856430, 29: 136002, 30: 136002, 31: 30566002, 32: 30566002, 33: 6358004, 34: 6358004, 35: 20406, 36: 20406, 37: 210802, 38: 210802, 39: 30579602, 40: 30579602, 41: 30579602, 42: 6371604, 43: 210802, 44: 210802, 45: 30579602, 46: 30579602, 47: 30579602, 48: 6371604, 49: 210802, 50: 210802, 51: 30579602, 52: 30579602, 53: 30579602, 54: 285602, 55: 30593202, 56: 37073606, 57: 37073606, 58: 37073606, 59: 37073606, 60: 37073606, 61: 12580008, 62: 12580008, 63: 12580008, 64: 12580008, 65: 12580008, 66: 56052408, 67: 56052408, 68: 56052408, 69: 56052408, 70: 56052408, 71: 38848410, 72: 38848410, 73: 38848410, 74: 38848410, 75: 38848410, 76: 50014012, 77: 50014012, 78: 50014012, 79: 5467216, 80: 5467216, 81: 5467216, 82: 25547618, 83: 25547618, 84: 25547618, 85: 38610420, 86: 38610420, 87: 38610420, 88: 5025224, 89: 5025224, 90: 5025224, 91: 4637626, 92: 4637626, 93: 4637626, 94: 12784028, 95: 12784028, 96: 12784028, 97: 21719230, 98: 21719230, 99: 21719230}, 'apidp': {0: 2051, 1: 1367, 2: 752052125, 3: 752052125, 4: 752052125, 5: 752052125, 6: 752052125, 7: 545740805, 8: 612001365, 9: 545740805, 10: 612001365, 11: 545740805, 12: 612001365, 13: 612001365, 14: 612001365, 15: 612001365, 16: 612001365, 17: 612001365, 18: 612001365, 19: 612001365, 20: 612001369, 21: 612001365, 22: 612001369, 23: 612001365, 24: 612001369, 25: 612001365, 26: 612001369, 27: 612001365, 28: 612001369, 29: 68002045, 30: 68002049, 31: 68002045, 32: 68002049, 33: 68002045, 34: 68002049, 35: 68002045, 36: 68002049, 37: 5451, 38: 6135, 39: 5451, 40: 6135, 41: 5298579, 42: 5451, 43: 4767, 44: 6135, 45: 4767, 46: 6135, 47: 5298579, 48: 4767, 49: 4767, 50: 5451, 51: 4767, 52: 5451, 53: 5298579, 54: 4885131, 55: 4885131, 56: 13644, 57: 748469885, 58: 748469889, 59: 749992405, 60: 749992409, 61: 13644, 62: 748469885, 63: 748469889, 64: 749992405, 65: 749992409, 66: 13644, 67: 748469885, 68: 748469889, 69: 749992405, 70: 749992409, 71: 13644, 72: 748469885, 73: 748469889, 74: 749992405, 75: 749992409, 76: 13644, 77: 748469885, 78: 748469889, 79: 13644, 80: 748469885, 81: 748469889, 82: 13644, 83: 748469885, 84: 748469889, 85: 13644, 86: 748469885, 87: 748469889, 88: 13644, 89: 748469885, 90: 748469889, 91: 13644, 92: 748469885, 93: 748469889, 94: 13644, 95: 748469885, 96: 748469889, 97: 13644, 98: 748469885, 99: 748469889}, 'wave': {0: 2.0, 1: 2.0, 2: 2.0, 3: 8.0, 4: 9.0, 5: 10.0, 6: 11.0, 7: 2.0, 8: 2.0, 9: 3.0, 10: 3.0, 11: 4.0, 12: 4.0, 13: 7.0, 14: 8.0, 15: 9.0, 16: 10.0, 17: 11.0, 18: 12.0, 19: 13.0, 20: 13.0, 21: 14.0, 22: 14.0, 23: 15.0, 24: 15.0, 25: 16.0, 26: 16.0, 27: 17.0, 28: 17.0, 29: 2.0, 30: 2.0, 31: 3.0, 32: 3.0, 33: 4.0, 34: 4.0, 35: 5.0, 36: 5.0, 37: 2.0, 38: 2.0, 39: 3.0, 40: 3.0, 41: 3.0, 42: 4.0, 43: 2.0, 44: 2.0, 45: 3.0, 46: 3.0, 47: 3.0, 48: 4.0, 49: 2.0, 50: 2.0, 51: 3.0, 52: 3.0, 53: 3.0, 54: 2.0, 55: 3.0, 56: 6.0, 57: 6.0, 58: 6.0, 59: 6.0, 60: 6.0, 61: 7.0, 62: 7.0, 63: 7.0, 64: 7.0, 65: 7.0, 66: 8.0, 67: 8.0, 68: 8.0, 69: 8.0, 70: 8.0, 71: 9.0, 72: 9.0, 73: 9.0, 74: 9.0, 75: 9.0, 76: 10.0, 77: 10.0, 78: 10.0, 79: 11.0, 80: 11.0, 81: 11.0, 82: 12.0, 83: 12.0, 84: 12.0, 85: 13.0, 86: 13.0, 87: 13.0, 88: 14.0, 89: 14.0, 90: 14.0, 91: 15.0, 92: 15.0, 93: 15.0, 94: 16.0, 95: 16.0, 96: 16.0, 97: 17.0, 98: 17.0, 99: 17.0}, 'rel': {0: 'unrelated sharer', 1: 'unrelated sharer', 2: 'natural child', 3: 'natural child', 4: 'natural child', 5: 'natural child', 6: 'natural child', 7: 'lawful spouse', 8: 'natural child', 9: 'lawful spouse', 10: 'natural child', 11: 'lawful spouse', 12: 'natural child', 13: 'natural child', 14: 'natural child', 15: 'natural child', 16: 'natural child', 17: 'natural child', 18: 'natural child', 19: 'natural child', 20: 'daughter/son-in-law', 21: 'natural child', 22: 'daughter/son-in-law', 23: 'natural child', 24: 'daughter/son-in-law', 25: 'natural child', 26: 'daughter/son-in-law', 27: 'natural child', 28: 'daughter/son-in-law', 29: 'lawful spouse', 30: 'natural child', 31: 'lawful spouse', 32: 'natural child', 33: 'lawful spouse', 34: 'natural child', 35: 'lawful spouse', 36: 'natural child', 37: 'unrelated sharer', 38: 'natural child', 39: 'other', 40: 'natural child', 41: 'daughter/son-in-law', 42: 'other', 43: 'unrelated sharer', 44: 'natural child', 45: 'other', 46: 'natural child', 47: 'daughter/son-in-law', 48: 'other', 49: 'natural parent', 50: 'natural parent', 51: 'natural parent', 52: 'natural parent', 53: 'lawful spouse', 54: 'natural brother/sister', 55: 'natural brother/sister', 56: 'natural child', 57: 'lawful spouse', 58: 'natural child', 59: 'mother/father-in-law', 60: 'mother/father-in-law', 61: 'natural child', 62: 'lawful spouse', 63: 'natural child', 64: 'mother/father-in-law', 65: 'mother/father-in-law', 66: 'natural child', 67: 'lawful spouse', 68: 'natural child', 69: 'mother/father-in-law', 70: 'mother/father-in-law', 71: 'natural child', 72: 'lawful spouse', 73: 'natural child', 74: 'mother/father-in-law', 75: 'mother/father-in-law', 76: 'natural child', 77: 'lawful spouse', 78: 'natural child', 79: 'natural child', 80: 'lawful spouse', 81: 'natural child', 82: 'natural child', 83: 'lawful spouse', 84: 'natural child', 85: 'natural child', 86: 'lawful spouse', 87: 'natural child', 88: 'natural child', 89: 'lawful spouse', 90: 'natural child', 91: 'natural child', 92: 'lawful spouse', 93: 'natural child', 94: 'natural child', 95: 'lawful spouse', 96: 'natural child', 97: 'natural child', 98: 'lawful spouse', 99: 'natural child'}, 'is_child': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '1', 9: '', 10: '1', 11: '', 12: '1', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '1', 31: '', 32: '1', 33: '', 34: '1', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '1', 57: '', 58: '1', 59: '', 60: '', 61: '1', 62: '', 63: '1', 64: '', 65: '', 66: '1', 67: '', 68: '1', 69: '', 70: '', 71: '1', 72: '', 73: '1', 74: '', 75: '', 76: '1', 77: '', 78: '1', 79: '1', 80: '', 81: '1', 82: '1', 83: '', 84: '1', 85: '1', 86: '', 87: '1', 88: '1', 89: '', 90: '1', 91: '1', 92: '', 93: '1', 94: '1', 95: '', 96: '1', 97: '1', 98: '', 99: '1'}, 'hh_n': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'bio_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'child_d': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'both_bio': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'hhtype': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}, 'step': {0: '', 1: '', 2: '', 3: '', 4: '', 5: '', 6: '', 7: '', 8: '', 9: '', 10: '', 11: '', 12: '', 13: '', 14: '', 15: '', 16: '', 17: '', 18: '', 19: '', 20: '', 21: '', 22: '', 23: '', 24: '', 25: '', 26: '', 27: '', 28: '', 29: '', 30: '', 31: '', 32: '', 33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 48: '', 49: '', 50: '', 51: '', 52: '', 53: '', 54: '', 55: '', 56: '', 57: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 65: '', 66: '', 67: '', 68: '', 69: '', 70: '', 71: '', 72: '', 73: '', 74: '', 75: '', 76: '', 77: '', 78: '', 79: '', 80: '', 81: '', 82: '', 83: '', 84: '', 85: '', 86: '', 87: '', 88: '', 89: '', 90: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 97: '', 98: '', 99: ''}}
答案 0 :(得分:0)
如所示,考虑将groupby().transform()
用于任何分组的行内聚合,以将新列分配给当前数据帧。由于OP的需求实际上是在计算(即从逻辑条件返回索引的长度或行数),因此请明确指定count
聚合。
因此,下面的.apply()
(隐藏循环,每次迭代都会构建一个辅助数据框架)
def hh_n(row):
# BOOLEAN INDEXING + LEN()
df = housing_df.loc[
(housing_df["pidp"] == row["pidp"]) & (housing_df["wave"] == row["wave"])
]
return str(len(df.index) + 1)
housing_df["hh_n"] = housing_df.apply(hh_n, axis=1)
可以调整为:
housing_df["hh_n"] = housing_df.groupby(['pidp', 'wave']).transform('count') + 1