Question

我有一个包含以下形式列的数据集：

'< 1 year'
'10+ years'
'6 years'
etc

我需要它转换为整数格式，即'＆lt; 1年' - ＆gt; 0，'10 +年' - ＆gt; 10和“6年”之类的条目 - ＆gt; 6等等。有500,000个条目。我编写了以下脚本来清理它：

temp = data.X11
for i in range(len(temp)):
    if ~is_nan(temp[i]):
        if isinstance(temp[i], six.string_types):
            b= temp[i].split(" ")
            if len(b) == 3 and (b[0])=='<':
                temp[i] = 0
            elif len(b) == 2:
                if b[0] == '10+':
                    temp[i] = 10
                else:
                    temp[i] = int(b[0])
        else:
            if isinstance(temp[i], float):
                temp[i] = math.floor(temp[i])
            if isinstance(temp[i], int):
                if temp[i] >= 10:
                    temp[i] = 10
                elif temp[i] < 1 and temp[i] >= 0:
                    temp[i] = 0
                elif temp[i] < 0:
                    temp[i] = -10
                else:
                    pass


    else:
        temp[i] = -10

有效。但缺点是，它非常缓慢（需要数小时才能完成）。我的问题是如何提高此代码的性能。

我们非常感谢您对代码段的任何建议或帮助。

由于

Answer 1

与熊猫一起您可以创建一个字典，然后用它映射您的数据框

dico = {'< 1 year' :1,'10+ years' :10,'6 years' :6 }
df['New_var'] = df.var1.map(dico)

只需几秒钟

Answer 2

我认为罪魁祸首就是这条线：

math.floor（TEMP [I]）

它返回一个浮点数，它比标准整数使用更多的位。将该操作的结果转换为整数可以提高性能。

另一种解决方案是升级到Python 3.x.x，就像那些版本中的floor和ceil都返回整数一样。

Answer 3

我不确定你能在这做多少。您可以尝试通过迭代临时值来避免temp[i]访问。您还可以将新值附加到另一个列表的末尾（快速），而不是修改中间的值（不是那么快）。

new_temp = list()
for temp_i in data.X11:
    if ~is_nan(temp_i):
        if isinstance(temp_i, six.string_types):
            b = temp_i.split(" ")
            if len(b) == 3 and (b[0])=='<':
                new_temp.append(0)
            elif len(b) == 2:
                if b[0] == '10+':
                    new_temp.append(10)
                else:
                    new_temp.append(int(b[0]))
        else:
            if isinstance(temp_i, float):
                new_temp.append(math.floor(temp_i))
            if isinstance(temp_i, int):
                if temp_i >= 10:
                    new_temp.append(10)
                elif temp_i < 1 and temp_i >= 0:
                    new_temp.append(0)
                elif temp_i < 0:
                    new_temp.append(-10)
    else:
        new_temp.append(-10)

string.split可能会很慢。

如果可能，您还可以尝试使用pypy执行代码，或者将其重写为与cython兼容。

如何在python中提高以下代码的性能

3 个答案: