熊猫:将列切片聚合为数组

时间:2020-07-31 18:59:40

标签: python pandas pandas-apply

我有一个看起来像这样的Pandas数据框:

# importing libraries we want to use
import socket
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# prompt user to input the Target IP
target = input("[+] Enter Target IP: ")

#define the scanner function and try to connect to the target IP and port
def scanner(port):
  try:
    sock.connect((target, port))
    return True
  except:
    return False

# Asking the user for the ports that they want to be scanned  
ports = input("Enter the ports to be scanned: ")    # User can enter as 100 122 123
ports_list = ports.split(' ')   # Splitting user input to get all the ports

#use for loop to state what ports are getting scanned, use scanner function to try and make a connection
try:
  for portNumber in ports_list:
    print("scanning port", portNumber)
    if scanner(portNumber ):
      print('[*] Port', portNumber, '/tcp', 'is open')
finally:
    pass

我正在尝试添加一个名为 Scaled Date 2020-07-01 02:40:00 0.604511 2020-07-01 02:45:00 0.640577 2020-07-01 02:50:00 0.587683 2020-07-01 02:55:00 0.491515 .... 的新列,该列应该看起来像这样,在此之前的每两个值都变成一个数组:

X

我正在尝试使用 Scaled X Date 2020-07-01 02:40:00 0.604511 nan 2020-07-01 02:45:00 0.640577 nan 2020-07-01 02:50:00 0.587683 [0.604511 0.640577] 2020-07-01 02:55:00 0.491515 [0.640577 0.587683] ... 循环来执行此操作,但是我认为这不是最优雅,最有效的方法,因此,关于在熊猫中执行此操作的任何建议吗? (但并没有按预期进行)

for

4 个答案:

答案 0 :(得分:1)

要使用熊猫,您可以使用列表理解以及concatshift

window_size = 2
df['X'] = (pd.concat([df.Scaled.shift(-i) for i in range(window_size)], axis=1)
             .shift(window_size).values.tolist())

Out[213]:
     Scaled                               X
0  0.604511                      [nan, nan]
1  0.640577                      [nan, nan]
2  0.587683  [0.604511, 0.6405770000000001]
3  0.491515  [0.6405770000000001, 0.587683]

答案 1 :(得分:0)

使用for循环是正确的主意。

首先,您必须初始化可以在数据框上使用.apply()的新列。

然后,您可以使用.iterrows()遍历数据帧的索引,从而在遍历行时创建所需的数组。

import pandas as pd

df = pd.DataFrame(data={'Date': ['2020-07-01 02:40:00', '2020-07-01 02:45:00', '2020-07-01 02:50:00', '2020-07-01 02:55:00'], 'Scaled': [0.604511, 0.640577, 0.587683, 0.491515]})

df['New_col'] = df['Scaled'].apply(lambda x : float("NAN"))

for i, val in df.iterrows():
  if i == 0 or i == 1:
    scaled_a = None
    scaled_b = None
  else:
    scaled_a = df['Scaled'][i-2]
    scaled_b = df['Scaled'][i-1]
  df['New_col'][i] = [scaled_a, scaled_b] 

只需在前两个索引处将新列的值分配给dataframe Scaled列的值,然后将其保存在数组中。希望对你有帮助!

    Date                Scaled      New_col
0   2020-07-01 02:40:00 0.604511    [None, None]
1   2020-07-01 02:45:00 0.640577    [None, None]
2   2020-07-01 02:50:00 0.587683    [0.604511, 0.640577]
3   2020-07-01 02:55:00 0.491515    [0.640577, 0.587683]

结果应如下所示。 ^^

答案 2 :(得分:0)

已更新 相同的输出。这是熊猫的实现。使用numpy生成列表,它是df的熊猫列,非常高效。

d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Date":d, 
      "Scaled":[round(Decimal(random.uniform(0, 1)),6) for x in d]})


# generate two new arrays that are shifted version of *scaled*
a1 = np.roll(df["Scaled"],1)
a1[0:2] = None
a2 = np.roll(df["Scaled"],2)
a2[0:2] = None
# combine them into a list and put back into df
df['X'] = np.vstack((a2, a1)).T.tolist()

print(df[:10].to_string(index=False))

输出

               Date    Scaled                     X
2020-07-01 00:00:00  0.396534          [None, None]
2020-07-01 00:15:00  0.890777          [None, None]
2020-07-01 00:30:00  0.241534  [0.396534, 0.890777]
2020-07-01 00:45:00  0.800615  [0.890777, 0.241534]
2020-07-01 01:00:00  0.161382  [0.241534, 0.800615]
2020-07-01 01:15:00  0.727410  [0.800615, 0.161382]
2020-07-01 01:30:00  0.146833  [0.161382, 0.727410]
2020-07-01 01:45:00  0.925441  [0.727410, 0.146833]
2020-07-01 02:00:00  0.770211  [0.146833, 0.925441]
2020-07-01 02:15:00  0.310082  [0.925441, 0.770211]

答案 3 :(得分:0)

这是不带for循环的版本。首先,创建数据框:

from io import StringIO

data = '''Date  Scaled 
2020-07-01 02:40:00  0.604511
2020-07-01 02:45:00  0.640577
2020-07-01 02:50:00  0.587683
2020-07-01 02:55:00  0.491515
'''
df = pd.read_csv(StringIO(data), sep='\s\s', engine='python')

接下来,使用shift()获取先前的值,然后lambda函数创建2元素列表或产生单个NaN:

f = lambda a, b: np.nan if np.isnan(a) or np.isnan(b) else [a, b]

window_size = 2

t = (pd.concat([df['Scaled'].shift(window_size).rename('a'), 
                df['Scaled'].shift(window_size - 1).rename('b')], axis=1
          )
       .apply(lambda x: f(x['a'].round(6), x['b'].round(6)), axis=1))

print(t)

0                     NaN
1                     NaN
2    [0.604511, 0.640577]
3    [0.640577, 0.587683]
dtype: object