pandas将无限制的文本文件读取到数据帧

时间:2016-11-17 17:18:17

标签: python pandas

我对熊猫很新。到目前为止,我一直在学习使用csv文件和excel电子表格的熊猫。

现在我面临着将文本文件转换为数据帧的问题。文本文件就是我所说的顺序数据。该文件的格式为:

State Name
City Name
State Name
City Name
City Name
City Name
...

列出了所有50个州和美国领土,但城市数量各不相同。我需要将其转换为像

这样的数据框
[[State Name, City Name1],[State Name, City Name2],...]

使用pandas read_table()方法,我至少能够将文件读入数据帧,但现在我不确定如何将其转换为正确的州名城市名称格式。

我还有一个州名/州2字母缩写词典。字典的格式是

{'OH':'OHIO', 'KY':'Kentucky',...}

有没有办法可以使用这个词典,循环文件并分离州和城市?还是有更简单的方法来实现这个目标?

谢谢

编辑 - 文本文件示例 下面列出了文本文件的示例。另外,请注意,我无法更改文件。

Alabama[edit]  
Auburn (Auburn University)[1]
Florence (University of North Alabama) 
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2] 
Montevallo (University of Montevallo)[2] 
Troy (Troy University)[2] 
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4] 

Tuskegee (Tuskegee University)[5] 
Alaska[edit] 
Fairbanks (University of Alaska Fairbanks)[2] 
Arizona[edit] 
Flagstaff (Northern Arizona University)[6] 
Tempe (Arizona State University) 
Tucson (University of Arizona)

2 个答案:

答案 0 :(得分:3)

我会创建一个填充了cities元组的(state_name, city_name)列表,然后将此元组列表转换为DataFrame

为此,您需要一个预编译列表,其中列出了文本文件中出现的所有状态,以便我们可以识别文件光标在状态行或城市行上的时间。

cities = []
list_of_states = ['Alaska', ..., 'Ohio', ...]

with open('file.csv') as f:
    for line in f:
        if line in list_of_states:
            state = line
        else:
            cities.append((state, line))

df = pandas.DataFrame(cities)

答案 1 :(得分:3)

假设您的列名为A。首先找到这样的州:

df.A.str.contains('\[edit\]')
Out[25]: 
0      True
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9      True
10    False
11     True
12    False
13    False
14    False

使用cumsum定义每个州+城市的索引:

csum = df.A.str.contains('\[edit\]').cumsum()
csum
Out[26]: 
0     1
1     1
2     1
3     1
4     1
5     1
6     1
7     1
8     1
9     2
10    2
11    3
12    3
13    3
14    3

现在你可以获得国家和城市:

states = df.groupby(csum).first()
states
Out[38]: 
                 A
A                 
1  Alabama[edit]  
2    Alaska[edit] 
3   Arizona[edit] 

cities = df.groupby(csum).apply(lambda g: g[1:])
cities
Out[39]: 
                                                      A
A                                                      
1 1                       Auburn (Auburn University)[1]
  2             Florence (University of North Alabama) 
  3     Jacksonville (Jacksonville State University)[2]
  4         Livingston (University of West Alabama)[2] 
  5           Montevallo (University of Montevallo)[2] 
  6                          Troy (Troy University)[2] 
  7   Tuscaloosa (University of Alabama, Stillman Co...
  8                  Tuskegee (Tuskegee University)[5] 
2 10     Fairbanks (University of Alaska Fairbanks)[2] 
3 12        Flagstaff (Northern Arizona University)[6] 
  13                  Tempe (Arizona State University) 
  14                     Tucson (University of Arizona)

现在加入数据框:

states.join(cities, rsuffix='_cities')
Out[49]: 
                    A                                           A_cities
A                                                                       
1 1   Alabama[edit]                        Auburn (Auburn University)[1]
  2   Alabama[edit]              Florence (University of North Alabama) 
  3   Alabama[edit]      Jacksonville (Jacksonville State University)[2]
  4   Alabama[edit]          Livingston (University of West Alabama)[2] 
  5   Alabama[edit]            Montevallo (University of Montevallo)[2] 
  6   Alabama[edit]                           Troy (Troy University)[2] 
  7   Alabama[edit]    Tuscaloosa (University of Alabama, Stillman Co...
  8   Alabama[edit]                   Tuskegee (Tuskegee University)[5] 
2 10    Alaska[edit]      Fairbanks (University of Alaska Fairbanks)[2] 
3 12   Arizona[edit]         Flagstaff (Northern Arizona University)[6] 
  13   Arizona[edit]                   Tempe (Arizona State University) 
  14   Arizona[edit]                      Tucson (University of Arizona)