我有这种格式的数据框。数据帧中共有907行,而名为“音频和句子”的2列。音频列包含一个列表列表,您可以看到。此列表的总长度为10000。
Audio sentence
[[-0.32357552647590637], [-0.4721883237361908],.....],the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock
[[-0.32357552647590637],[-0.4721883237361908],.....]]the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock
我试图将list转换为dataframe,但是它将每个字符分隔开,这不是我的目标。
aa= pd.DataFrame.from_records(X_tra)
它做了这样的事情。
0 1 2 3 4 5 6 7 8 9 ... 269990 269991 269992 269993 269994 269995 269996 269997 269998 269999
0 [ [ 0 . 0 0 3 9 1 1 ... None None None None None None None None None None
Audio sentence
[[-0.32357552647590637], [-0.4721883237361908],.....],the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock
[[-0.32357552647590637],[-0.4721883237361908],.....]]the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock
给定输出上方的是实际输出。 预期输出如下。
Audio Audio1 sentence
-0.32357552647590637 -0.4721883237361908 ..... the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock
-0.32357552647590637 -0.4721883237361908 ......the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock
我想使用此输出来训练神经网络,因此我的句子列将为Y,数据框的其余部分将为X。
答案 0 :(得分:1)
该解决方案如何?
import pandas as pd
import numpy as np
data = pd.DataFrame({'Audio':[[[-0.32357552647590637],[-0.4721883237361908]], [[-0.32357552647590637], [-0.4721883237361908]]],
'sentence':['the kind of them is a relative all the little old', 'More text']})
audios = data.Audio.apply(lambda x: np.ravel(np.array(x))).apply(pd.Series)
audios.columns = ['Audio'+ str(i) for i in range(len(audios.columns))]
audios['sentence'] = data['sentence']
示例数据为:
Audio sentence
0 [[-0.32357552647590637], [-0.4721883237361908]] the kind of them is a relative all the little old
1 [[-0.32357552647590637], [-0.4721883237361908]] More text
(在DF音频中)结果是:
Audio0 Audio1 sentence
0 -0.323576 -0.472188 the kind of them is a relative all the little old
1 -0.323576 -0.472188 More text
答案 1 :(得分:0)
第一步,我将生成列名列表:
N = 10000
colNames = ["Audio" + str(i) for i in range(N)]
我将使用以下内容从您之前的数据帧df2
创建第二个数据帧df
:
df2 = pd.DataFrame()
df2[colNames] = pd.DataFrame(df["Audio"].values.tolist(), index=df.index)
这应该非常接近您想要的,除了每个值仍然在列表中。因此结果应类似于此:
>>> df2
Audio0 Audio1 Audio2
0 [-0.32357552647590637] [-0.4721883237361908] ...
1 [-0.32357552647590637] [-0.4721883237361908] ...
2 ...
希望这会有所帮助。
答案 2 :(得分:0)
您可以做的是将“ df.Audio”的每个条目展平,并使用正确的列名构造一个新的DataFrame
。
# Flatten list in each row
audio_list_flat = []
for nested_list in list(df["Audio"]):
audio_list_flat.append([y for x in nested_list for y in x])
# Get row with max length, assuming the length of Audio could be different
max_len = max([len(x) for x in audio_list_flat])
# Construct new dataframe
flat_df = pd.DataFrame(audio_list_flat,
columns=[f"Audio{i}" for i in range(max_len)],
index=df.index)
flat_df["sentence"] = df.sentence
这样,您可以使用纯pandas
解决此问题,而无需添加更多依赖项。