Question

我想我很想保存一个包含一系列熊猫数据帧的熊猫系列。原来，每个DataFrame的保存方式就像我在它们上调用df.to_string()一样。

到目前为止，根据我的观察，我的字符串在某些地方有额外的间距，而当DataFrame的列太多而无法在同一行上显示时，它们的\也有额外的间距。

这是一个“更合适的DataFrame：

df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6, 
                  data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])

我希望将其转换为DataFrame的字符串如下：

# str(df)

'  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \\\n0                     some really long data                           2  3   \n1                                         4                           5  6   \n2                                         7                           8  9   \n\n  really long name that goes on for a while  another really long string  c  \n0                     some really long data                           2  3  \n1                                         4                           5  6  \n2                                         7                           8  9  '

如何将这样的字符串恢复为DataFrame？

谢谢

Answer 1

新答案

针对您经过编辑的新问题，我最好的答案是使用to_csv而不是to_string。 to_string确实不像to_csv一样支持这种用例（而且我不明白如何避免您从StringIO实例进行一堆转换...）。 / p>

df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6, 
                  data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])
s = StringIO()
df.to_csv(s)
# To get the string use, `s.getvalue()`
# Warning: will exhaust `s`

pd.read_csv(StringIO(s.getvalue()))

希望此更新对您有所帮助，我将保留原有答案，以确保连续性。

旧答案

一个非常酷的转折，它的答案还可以帮助您读取stackoverflow上输出的数据帧的常用粘贴格式。考虑我们可以从这样的字符串中读取df：

data = """    0   20   30   40   50
 1  5  NaN   3    5   NaN
 2  2   3    4   NaN   4
 3  6   1    3    1   NaN"""

import pandas as pd
from io import StringIO
data = StringIO(data)
df = pd.read_csv(data, sep="\\s+")

这将导致以下df：

您可以通过以下方式读取to_string的输出：

pd.read_csv(StringIO(df.to_string()), sep="\\s+")

得到的df是相同的。

Answer 2

尝试一下。更新为包括自动计算行数的逻辑。基本上，我提取了原始数据帧的索引（行号）中的最大值，该索引位于大字符串内。

如果我们使用您提供的示例将数据帧转换为字符串，则开始：

df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6, 
                  data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])

string = str(df)

首先，让我们提取列名称：

import re
import numpy as np

lst = re.split('\n', string)
num_rows = int(lst[lst.index('') -1][0]) + 1
col_names = []
lst = [i for i in lst if i != '']

for i in range(0,len(lst), num_rows + 1):
    col_names.append(lst[i])

new_col_names = []
for i in col_names:
    new_col_names.append(re.split('  ', i))

final_col_names = []
for i in new_col_names:
    final_col_names += i

final_col_names = [i for i in final_col_names if i != '']
final_col_names = [i for i in final_col_names if i != '\\']

然后，让我们获取数据：

for i in col_names:
    lst.remove(i)

new_lst = [re.split(r'\s{2,}', i) for i in lst]
new_lst = [i[1:-1] for i in new_lst]

newer_lst = []
for i in range(num_rows):
    sub_lst = []
    for j in range(i,len(final_col_names), num_rows):
        sub_lst += new_lst[j]
    newer_lst.append(sub_lst)

reshaped = np.reshape(newer_lst, (num_rows,len(final_col_names)))

最后，我们可以使用数据和列名称创建重建的数据框：

fixed_df = pd.DataFrame(data=reshaped, columns = final_col_names)

我的代码执行了一些循环，因此如果您的原始数据帧具有成千上万的行，则此方法可能需要一段时间。

Answer 3

因此，我不确定这个问题对任何人有多大帮助，但我写了一个函数（和一个助手）来尝试带回我错误地存储为嵌套在pd.Series中的DataFrames的数据。

以下是功能：

def insertNan(substring):
    rows = substring.split('\n')
    headers = re.sub("  \s+", "  ", rows[0].replace("\\","").strip()).split("  ")
    #  The [2] below is a placeholder for the index. (Look in str(df), may appear like "\\\n1")
    # Notice that if your tables get past 100 rows, 2 needs to be 3, or be determined otherwise.
    boundaries = [0] + [2] + [rows[0].find(header)+len(header) for header in headers]
    values = []
    for i, row in enumerate(rows):
        values.append(row)
        # First row is just column headers. If no headers then don't use these functions
        if i==0:
            continue
        for j, bound in enumerate(boundaries[:-1]):
            value = row[bound:boundaries[j+1]].strip()
            if not value:
                newstring = list(values[i])
                newstring[boundaries[j+1]-3:boundaries[j+1]] = "NaN"
                values[i] = ''.join(newstring)
            if "  " in value:
                start = values[i].find(value)
                newvalue = re.sub(" \s+", " ", value)
                values[i] = values[i][:start]+newvalue+values[i][start+len(value)]
    return '\n'.join(values)

def from_string(string):
    string = string.replace("\\", "")
    chunks = [insertNan(i).strip() for i in string.split("\n\n")]
    frames = [pd.read_csv(StringIO(chunk), sep=" \\s+", engine='python') 
              for chunk in chunks]
    return pd.concat(frames, axis=1)

# Read file and loop through series. These two lines might have to be modified.
corrupted_results = pd.read_excel(fileio, squeeze=True)
results = [from_string(result for result in corrupted_results.values

这几乎使我回到了开始的pd.Series（结果）。

除了一些过长的文本输入被"..."切断。

总而言之，将数据保存为嵌套在pd.Series中的DataFrames可能不是一个好主意。现在，我决定保存通过将数据帧与添加的“名称”列进行连接而制成的串联数据帧，该列允许我稍后在需要时使用.groupby进行分隔。

请注意，如果保存在pd.Series中的DataFrame没有标题，那么除非进行修改，否则我提供的功能可能无法正常工作。

特别感谢ColdSpeed，Charles Landau和JamesD的时间，帮助和友善！

使用str（df）后如何取回DataFrame？

3 个答案:

新答案

旧答案