根据列的数值添加行

时间:2017-06-28 17:34:21

标签: python pandas

我可能会以错误的方式解决这个问题,但是我计划分析我的数据,每个应用程序都需要一个条目。

我的数据框看起来像这样:

cp -r ${android_studio_installation}/plugins/android/lib/template ${android_sdk}/tools/
chmod +x ${android_sdk}/tools/templates/gradle/wrapper/gradlew

我需要让它看起来像这样(1 =是,0 =否):

ID   Job Title  Number Applied  Hired  Feature(Math)
 1  Accountant               3      2              1
 2   Marketing               1      1              0
 3     Finance               1      1              1

我需要为每个申请人添加一行。 ID Job Title Number Applied Hired Feature(Math) 1 Accountant 1 0 1 2 Accountant 1 1 1 3 Accountant 1 1 1 4 Marketing 1 1 0 5 Finance 1 1 1 应始终为1.完成此操作后,我们可以删除Number Applied列。

我还没有包含其他功能。分析的重点是应用机器学习算法来预测一个人是否会根据他们的技能组找到一份工作。我当前的数据框架不起作用,因为当我将雇用转换为是或否时,它认为只有2人具有数学技能而不是3人。

2 个答案:

答案 0 :(得分:1)

这是我之前用来“展开”一组聚合样本的方法。

from itertools import imap, izip

def iterdicts(df):
    """
    Utility to iterate over rows of a data frame as dictionaries.
    """
    col = df.columns
    for row in df.itertuples(name=None, index=False):
        yield dict(zip(col, row))

def deaggregate(dicts, *columns):
    """
    Deaggregate an iterable of dictionaries `dicts` where the numbers in `columns`
    are assumed to be aggregated counts.
    """
    for row in dicts:
        for i in xrange(max(row[c] for c in columns)):
            d = dict(row)

            # replace each count by a 0/1 indicator
            d.update({c: int(i < row[c]) for c in columns})
            yield d

def unroll(df, *columns):
    return pd.DataFrame(deaggregate(iterdicts(df), *columns))

然后你可以做

unroll(df, 'Number Applied', 'Hired')
   Feature(Math)  Hired  ID   Job Title  Number Applied
0              1      1   1  Accountant               1
1              1      1   1  Accountant               1
2              1      0   1  Accountant               1
3              0      1   2   Marketing               1
4              1      1   3     Finance               1

答案 1 :(得分:1)

d1 = df.loc[df.index.repeat(df['Number Applied'])]

hired = (
    d1.groupby('Job Title').cumcount() >=
        d1['Number Applied'] - d1['Hired']
).astype(int)

d1.assign(**{'Number Applied': 1, 'Hired': hired})

   ID   Job Title  Number Applied  Hired  Feature(Math)
0   1  Accountant               1      0              1
0   1  Accountant               1      1              1
0   1  Accountant               1      1              1
1   2   Marketing               1      1              0
2   3     Finance               1      1              1