Question

我有这种格式的数据框。数据帧中共有907行，而名为“音频和句子”的2列。音频列包含一个列表列表，您可以看到。此列表的总长度为10000。

Audio                                                     sentence
[[-0.32357552647590637], [-0.4721883237361908],.....],the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock
 [[-0.32357552647590637],[-0.4721883237361908],.....]]the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock

我试图将list转换为dataframe，但是它将每个字符分隔开，这不是我的目标。

aa= pd.DataFrame.from_records(X_tra)

它做了这样的事情。

0   1   2   3   4   5   6   7   8   9   ...     269990  269991  269992  269993  269994  269995  269996  269997  269998  269999
0   [   [   0   .   0   0   3   9   1   1   ...     None    None    None    None    None    None    None    None    None    None

Audio                                                     sentence
[[-0.32357552647590637], [-0.4721883237361908],.....],the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock
 [[-0.32357552647590637],[-0.4721883237361908],.....]]the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock

给定输出上方的

是实际输出。预期输出如下。

Audio                  Audio1                    sentence
-0.32357552647590637 -0.4721883237361908 ..... the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock
-0.32357552647590637 -0.4721883237361908 ......the kind of them is a relative all the little old lady is it to confide in them and head for buying them hate it consists of a vertical schrock

我想使用此输出来训练神经网络，因此我的句子列将为Y，数据框的其余部分将为X。

Answer 1

该解决方案如何？

import pandas as pd
import numpy as np

data = pd.DataFrame({'Audio':[[[-0.32357552647590637],[-0.4721883237361908]], [[-0.32357552647590637], [-0.4721883237361908]]],
        'sentence':['the kind of them is a relative all the little old', 'More text']})

audios = data.Audio.apply(lambda x: np.ravel(np.array(x))).apply(pd.Series)
audios.columns = ['Audio'+ str(i) for i in range(len(audios.columns))]

audios['sentence'] = data['sentence']

示例数据为：


                  Audio                                    sentence
0   [[-0.32357552647590637], [-0.4721883237361908]] the kind of them is a relative all the little old
1   [[-0.32357552647590637], [-0.4721883237361908]] More text

（在DF音频中）结果是：

    Audio0       Audio1      sentence
0   -0.323576   -0.472188   the kind of them is a relative all the little old
1   -0.323576   -0.472188   More text

Answer 2

第一步，我将生成列名列表：

N = 10000
colNames = ["Audio" + str(i) for i in range(N)]

我将使用以下内容从您之前的数据帧df2创建第二个数据帧df：

df2 = pd.DataFrame()
df2[colNames] =  pd.DataFrame(df["Audio"].values.tolist(), index=df.index)

这应该非常接近您想要的，除了每个值仍然在列表中。因此结果应类似于此：

>>> df2
     Audio0                    Audio1                   Audio2
0    [-0.32357552647590637]    [-0.4721883237361908]    ...
1    [-0.32357552647590637]    [-0.4721883237361908]    ...
2    ...

希望这会有所帮助。

Answer 3

您可以做的是将“ df.Audio”的每个条目展平，并使用正确的列名构造一个新的DataFrame。

# Flatten list in each row
audio_list_flat = []
for nested_list in list(df["Audio"]):
    audio_list_flat.append([y for x in nested_list for y in x])

# Get row with max length, assuming the length of Audio could be different
max_len = max([len(x) for x in audio_list_flat])

# Construct new dataframe
flat_df = pd.DataFrame(audio_list_flat,
                       columns=[f"Audio{i}" for i in range(max_len)],
                       index=df.index)
flat_df["sentence"] = df.sentence

这样，您可以使用纯pandas解决此问题，而无需添加更多依赖项。

将列表列表转换为数据框

3 个答案: