我对熊猫很新。到目前为止,我一直在学习使用csv文件和excel电子表格的熊猫。
现在我面临着将文本文件转换为数据帧的问题。文本文件就是我所说的顺序数据。该文件的格式为:
State Name
City Name
State Name
City Name
City Name
City Name
...
列出了所有50个州和美国领土,但城市数量各不相同。我需要将其转换为像
这样的数据框[[State Name, City Name1],[State Name, City Name2],...]
使用pandas read_table()方法,我至少能够将文件读入数据帧,但现在我不确定如何将其转换为正确的州名城市名称格式。
我还有一个州名/州2字母缩写词典。字典的格式是
{'OH':'OHIO', 'KY':'Kentucky',...}
有没有办法可以使用这个词典,循环文件并分离州和城市?还是有更简单的方法来实现这个目标?
谢谢
编辑 - 文本文件示例 下面列出了文本文件的示例。另外,请注意,我无法更改文件。
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
答案 0 :(得分:3)
我会创建一个填充了cities
元组的(state_name, city_name)
列表,然后将此元组列表转换为DataFrame
。
为此,您需要一个预编译列表,其中列出了文本文件中出现的所有状态,以便我们可以识别文件光标在状态行或城市行上的时间。
cities = []
list_of_states = ['Alaska', ..., 'Ohio', ...]
with open('file.csv') as f:
for line in f:
if line in list_of_states:
state = line
else:
cities.append((state, line))
df = pandas.DataFrame(cities)
答案 1 :(得分:3)
假设您的列名为A
。首先找到这样的州:
df.A.str.contains('\[edit\]')
Out[25]:
0 True
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 True
10 False
11 True
12 False
13 False
14 False
使用cumsum
定义每个州+城市的索引:
csum = df.A.str.contains('\[edit\]').cumsum()
csum
Out[26]:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 2
10 2
11 3
12 3
13 3
14 3
现在你可以获得国家和城市:
states = df.groupby(csum).first()
states
Out[38]:
A
A
1 Alabama[edit]
2 Alaska[edit]
3 Arizona[edit]
cities = df.groupby(csum).apply(lambda g: g[1:])
cities
Out[39]:
A
A
1 1 Auburn (Auburn University)[1]
2 Florence (University of North Alabama)
3 Jacksonville (Jacksonville State University)[2]
4 Livingston (University of West Alabama)[2]
5 Montevallo (University of Montevallo)[2]
6 Troy (Troy University)[2]
7 Tuscaloosa (University of Alabama, Stillman Co...
8 Tuskegee (Tuskegee University)[5]
2 10 Fairbanks (University of Alaska Fairbanks)[2]
3 12 Flagstaff (Northern Arizona University)[6]
13 Tempe (Arizona State University)
14 Tucson (University of Arizona)
现在加入数据框:
states.join(cities, rsuffix='_cities')
Out[49]:
A A_cities
A
1 1 Alabama[edit] Auburn (Auburn University)[1]
2 Alabama[edit] Florence (University of North Alabama)
3 Alabama[edit] Jacksonville (Jacksonville State University)[2]
4 Alabama[edit] Livingston (University of West Alabama)[2]
5 Alabama[edit] Montevallo (University of Montevallo)[2]
6 Alabama[edit] Troy (Troy University)[2]
7 Alabama[edit] Tuscaloosa (University of Alabama, Stillman Co...
8 Alabama[edit] Tuskegee (Tuskegee University)[5]
2 10 Alaska[edit] Fairbanks (University of Alaska Fairbanks)[2]
3 12 Arizona[edit] Flagstaff (Northern Arizona University)[6]
13 Arizona[edit] Tempe (Arizona State University)
14 Arizona[edit] Tucson (University of Arizona)