我有一个pandas数据帧,表示如下: -
data=pd.read_csv("training-set-org.csv",sep=',', header = None)
打印时的输出如下: -
print(data.head())
0 1 2 3 4 5 6 7 \
0 22.896448 33.1366 18.738063 26.846212 6 4242 50257 131962
1 22.896448 33.1366 18.738063 26.846212 6 4242 50257 68719
2 22.896448 33.1366 18.738063 26.846212 6 4242 50257 171647
3 22.896448 33.1366 18.738063 26.846212 6 4242 50257 246620
4 22.896448 33.1366 18.738063 26.846212 6 4242 50257 64072
现在我放下第4列
data.drop(data.columns[4],axis=1,inplace=True)
据我所知,data.columns [4]引用标记为4的列,这是正确的。
现在,当我打印数据帧时,我得到: -
printing data: 0 1 2 3 5 6 7
0 22.896448 33.1366 18.738063 26.846212 4242 50257 131962
1 22.896448 33.1366 18.738063 26.846212 4242 50257 68719
2 22.896448 33.1366 18.738063 26.846212 4242 50257 171647
3 22.896448 33.1366 18.738063 26.846212 4242 50257 246620
4 22.896448 33.1366 18.738063 26.846212 4242 50257 64072
如您所见,标签4缺失。
如何重新标记数据框,使每个列标签向左移动,以便列标记为0,1,2,3,4..6而不是7。 我希望使用数量较少的数据帧数据,并在循环中使用data.iloc [:,i]处理列。 我该怎么做呢?。我仍处于python的初期阶段。所以任何帮助都表示赞赏..
答案 0 :(得分:1)
您可以指定由RangeIndex
创建的默认列:
data.columns = pd.RangeIndex(len(data.columns))
print (data)
0 1 2 3 4 5 6
0 22.896448 33.1366 18.738063 26.846212 4242 50257 131962
1 22.896448 33.1366 18.738063 26.846212 4242 50257 68719
2 22.896448 33.1366 18.738063 26.846212 4242 50257 171647
3 22.896448 33.1366 18.738063 26.846212 4242 50257 246620
4 22.896448 33.1366 18.738063 26.846212 4242 50257 64072
或使用range
:
data.columns = range(len(data.columns))
print (data)
0 1 2 3 4 5 6
0 22.896448 33.1366 18.738063 26.846212 4242 50257 131962
1 22.896448 33.1366 18.738063 26.846212 4242 50257 68719
2 22.896448 33.1366 18.738063 26.846212 4242 50257 171647
3 22.896448 33.1366 18.738063 26.846212 4242 50257 246620
4 22.896448 33.1366 18.738063 26.846212 4242 50257 64072
计时:仅限有趣的事情:)
In [126]: %timeit data.columns = range(len(data.columns))
The slowest run took 4.70 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 23.4 µs per loop
In [127]: %timeit data.columns = pd.RangeIndex(len(data.columns))
The slowest run took 4.61 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.4 µs per loop
In [128]: %timeit data.columns = np.arange(len(data.columns))
The slowest run took 8.52 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 45.2 µs per loop
答案 1 :(得分:0)
如果您的列标签只是整数,则可以使用以下代码:
import numpy as np
data.columns = np.arange(len(data.columns))
答案 2 :(得分:0)
很简单,试试看:
data.columns = range(7)