Question

我有DataFrame看起来像这样

df
    A B C
    0 2 5A5A5A5A5A5A5A5A5A5A5A5A5A5A5A5A

我想生成以下内容

A B C         Offset
0 2 5A5A5A5A  0
0 2 5A5A5A5A  1
0 2 5A5A5A5A  2
0 2 5A5A5A5A  3

当应用于数百万行时，这是我的不可扩展且缓慢的解决方案：

def splitequal(my_str):
    splits = [my_str[x:x+8] for x in range(0,len(my_str),8)]
    return splits

def tondata(row):
    offset = row['Offset']
    return row['Splits'][offset]

d = {'A': [0],
     'B': [2],
     'C': ["5A5A5A5A5A5A5A5A5A5A5A5A5A5A5A5A"]}

df = pd.DataFrame(d,columns=['A','B','C'])

#Replicate the row 4 times
df2 = pd.DataFrame(np.repeat(df.as_matrix(),4,0),columns=['A','B','C'])

# Create the offset column to create 4 substrings
df2['Offset'] = df2.reset_index()['index'] % 4

#Split the string and create an array of 4 strings
df2['Splits'] = df2['C'].apply(splitequal)

#assign each substrings in the array to the 4 different offsets
df2['C'] = df2.apply(tondata,axis=1)

del(df2['Splits'])
print df2

  A  B         C  Offset
0  0  2  5A5A5A5A       0
1  0  2  5A5A5A5A       1
2  0  2  5A5A5A5A       2
3  0  2  5A5A5A5A       3

有更快的方法吗？

Answer 1

您可以尝试以下方法：

# Get unique index on the data frame
df = df.reset_index()

# Slice the column, concatenate the results together and rename the columns
splitted = pd.concat([
    df["C"].str.slice(i * 8, (i + 1) * 8) for i in range(4)
], axis=1)
splitted.columns = [0, 1, 2, 3]

# Unstack to get a single column with offsets as first index level
unstacked = splitted.unstack()

# Make the new index level an ordinary column
with_offset_col = unstacked.reset_index(level=0)

# Merge this together with the original frame again
pd.merge(df, with_offset_col, left_index=True, right_index=True)

此代码在我的机器上以4.1s执行。

Pandas将一行划分为4个不同的行，同时将列字符串拆分为4

1 个答案: