How do I use Pandas to read in multiple datasets from one file?

时间:2016-07-11 19:51:27

标签: python pandas

I have a file that has multiple sets of data separated by rows. It looks something like:

country1  
0.9  
1.3  
2.9  
1.1  
...  
country2  
4.1  
3.1  
0.2
...

I would like to use Pandas to read the whole file into multiple dataframes, where each dataframe corresponds to a country. Is there any easy way to do this? Each country has a different number of entries.

2 个答案:

答案 0 :(得分:6)

You can create mask by to_numeric with errors='coerce', so get NaN where are column names. Then find them by isnull and create groups by cumsum:

import pandas as pd
import io

temp=u"""country1
0.9
1.3
2.9
1.1
country2
4.1
3.1
0.2"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), index_col=None, header=None)
print (df)
          0
0  country1
1       0.9
2       1.3
3       2.9
4       1.1
5  country2
6       4.1
7       3.1
8       0.2
mask = pd.to_numeric(df.iloc[:,0], errors='coerce').isnull().cumsum()
print (mask)
0    1
1    1
2    1
3    1
4    1
5    2
6    2
7    2
8    2
Name: 0, dtype: int32

Last use list comprehension for list of dataframes:

dfs = [g[1:].rename(columns={0:g.iloc[0].values[0]}) for i, g in df.groupby(mask)]

print (dfs)

print (dfs[0])
  country1
1      0.9
2      1.3
3      2.9
4      1.1

print (dfs[1])
  country2
6      4.1
7      3.1
8      0.2

If need reset index:

dfs = [g[1:].rename(columns={0:g.iloc[0].values[0]}).reset_index(drop=True) for i, g in df.groupby(mask)]

print (dfs)

print (dfs[0])
  country1
0      0.9
1      1.3
2      2.9
3      1.1
print (dfs[1])
  country2
0      4.1
1      3.1
2      0.2

答案 1 :(得分:1)

Pandas supports standard file formats like csv and json, and this is not one of those. I'm going to assume reformatting the file by hand is a waste of time and suggest you parse the file yourself using with open(...) as f: and f.readlines() into python objects.

Say you've done that, and the format looks like data, then from_dict() should work:

data = { "countryName1": [0.9, 1.3, ...], "countryName2": [...]} 
df = pd.DataFrame.from_dict(data)