我想我很想保存一个包含一系列熊猫数据帧的熊猫系列。原来,每个DataFrame的保存方式就像我在它们上调用df.to_string()
一样。
到目前为止,根据我的观察,我的字符串在某些地方有额外的间距,而当DataFrame的列太多而无法在同一行上显示时,它们的\
也有额外的间距。
这是一个“更合适的DataFrame:
df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6,
data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])
我希望将其转换为DataFrame的字符串如下:
# str(df)
' really long name that goes on for a while another really long string c \\\n0 some really long data 2 3 \n1 4 5 6 \n2 7 8 9 \n\n really long name that goes on for a while another really long string c \\\n0 some really long data 2 3 \n1 4 5 6 \n2 7 8 9 \n\n really long name that goes on for a while another really long string c \\\n0 some really long data 2 3 \n1 4 5 6 \n2 7 8 9 \n\n really long name that goes on for a while another really long string c \\\n0 some really long data 2 3 \n1 4 5 6 \n2 7 8 9 \n\n really long name that goes on for a while another really long string c \\\n0 some really long data 2 3 \n1 4 5 6 \n2 7 8 9 \n\n really long name that goes on for a while another really long string c \n0 some really long data 2 3 \n1 4 5 6 \n2 7 8 9 '
如何将这样的字符串恢复为DataFrame?
谢谢
答案 0 :(得分:2)
针对您经过编辑的新问题,我最好的答案是使用to_csv
而不是to_string
。 to_string
确实不像to_csv
一样支持这种用例(而且我不明白如何避免您从StringIO实例进行一堆转换...)。 / p>
df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6,
data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])
s = StringIO()
df.to_csv(s)
# To get the string use, `s.getvalue()`
# Warning: will exhaust `s`
pd.read_csv(StringIO(s.getvalue()))
希望此更新对您有所帮助,我将保留原有答案,以确保连续性。
一个非常酷的转折,它的答案还可以帮助您读取stackoverflow上输出的数据帧的常用粘贴格式。考虑我们可以从这样的字符串中读取df
:
data = """ 0 20 30 40 50
1 5 NaN 3 5 NaN
2 2 3 4 NaN 4
3 6 1 3 1 NaN"""
import pandas as pd
from io import StringIO
data = StringIO(data)
df = pd.read_csv(data, sep="\\s+")
这将导致以下df:
您可以通过以下方式读取to_string
的输出:
pd.read_csv(StringIO(df.to_string()), sep="\\s+")
得到的df
是相同的。
答案 1 :(得分:1)
尝试一下。更新为包括自动计算行数的逻辑。基本上,我提取了原始数据帧的索引(行号)中的最大值,该索引位于大字符串内。
如果我们使用您提供的示例将数据帧转换为字符串,则开始:
df = pd.DataFrame(columns=["really long name that goes on for a while", "another really long string", "c"]*6,
data=[["some really long data",2,3]*6,[4,5,6]*6,[7,8,9]*6])
string = str(df)
import re
import numpy as np
lst = re.split('\n', string)
num_rows = int(lst[lst.index('') -1][0]) + 1
col_names = []
lst = [i for i in lst if i != '']
for i in range(0,len(lst), num_rows + 1):
col_names.append(lst[i])
new_col_names = []
for i in col_names:
new_col_names.append(re.split(' ', i))
final_col_names = []
for i in new_col_names:
final_col_names += i
final_col_names = [i for i in final_col_names if i != '']
final_col_names = [i for i in final_col_names if i != '\\']
for i in col_names:
lst.remove(i)
new_lst = [re.split(r'\s{2,}', i) for i in lst]
new_lst = [i[1:-1] for i in new_lst]
newer_lst = []
for i in range(num_rows):
sub_lst = []
for j in range(i,len(final_col_names), num_rows):
sub_lst += new_lst[j]
newer_lst.append(sub_lst)
reshaped = np.reshape(newer_lst, (num_rows,len(final_col_names)))
fixed_df = pd.DataFrame(data=reshaped, columns = final_col_names)
我的代码执行了一些循环,因此如果您的原始数据帧具有成千上万的行,则此方法可能需要一段时间。
答案 2 :(得分:0)
因此,我不确定这个问题对任何人有多大帮助,但我写了一个函数(和一个助手)来尝试带回我错误地存储为嵌套在pd.Series中的DataFrames的数据。
以下是功能:
def insertNan(substring):
rows = substring.split('\n')
headers = re.sub(" \s+", " ", rows[0].replace("\\","").strip()).split(" ")
# The [2] below is a placeholder for the index. (Look in str(df), may appear like "\\\n1")
# Notice that if your tables get past 100 rows, 2 needs to be 3, or be determined otherwise.
boundaries = [0] + [2] + [rows[0].find(header)+len(header) for header in headers]
values = []
for i, row in enumerate(rows):
values.append(row)
# First row is just column headers. If no headers then don't use these functions
if i==0:
continue
for j, bound in enumerate(boundaries[:-1]):
value = row[bound:boundaries[j+1]].strip()
if not value:
newstring = list(values[i])
newstring[boundaries[j+1]-3:boundaries[j+1]] = "NaN"
values[i] = ''.join(newstring)
if " " in value:
start = values[i].find(value)
newvalue = re.sub(" \s+", " ", value)
values[i] = values[i][:start]+newvalue+values[i][start+len(value)]
return '\n'.join(values)
def from_string(string):
string = string.replace("\\", "")
chunks = [insertNan(i).strip() for i in string.split("\n\n")]
frames = [pd.read_csv(StringIO(chunk), sep=" \\s+", engine='python')
for chunk in chunks]
return pd.concat(frames, axis=1)
# Read file and loop through series. These two lines might have to be modified.
corrupted_results = pd.read_excel(fileio, squeeze=True)
results = [from_string(result for result in corrupted_results.values
这几乎使我回到了开始的pd.Series(结果)。
除了一些过长的文本输入被"..."
切断。
总而言之,将数据保存为嵌套在pd.Series中的DataFrames可能不是一个好主意。现在,我决定保存通过将数据帧与添加的“名称”列进行连接而制成的串联数据帧,该列允许我稍后在需要时使用.groupby
进行分隔。
请注意,如果保存在pd.Series中的DataFrame没有标题,那么除非进行修改,否则我提供的功能可能无法正常工作。
特别感谢ColdSpeed,Charles Landau和JamesD的时间,帮助和友善!