根据下面的列表,我必须创建一个带有“状态”和“区域”列的数据框:
原始数据:
Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
(此处为数据link。)
所需的输出:
State Region
Alabama Auburn
Alabama Florence
Alabama Jacksonville
Alabama Livingston
Alabama Montevallo
Alabama Troy
Alabama Tuscaloosa
Alabama Tuskegee
Alaska Fairbanks
Arizona Flagstaff
Arizona Tempe
代码:
df = pd.DataFrame(columns=['State', 'RegionName'])
with open('university_towns.txt', 'r') as UniversityList:
content = UniversityList.readlines()
state_row = []
region_row = []
for row in content:
if '[edit]' in row:
state_row.append(row)
region_row.append('region_to_be_repeated')
else:
region_row.append(row)
state_row.append('state_to_be_repeated')
在“ if”为True的情况下,如何将'state_to_be_reapeted'
替换为附加内容?
答案 0 :(得分:0)
如果我理解您的问题并希望输出正确,则可以执行以下操作:
univeristylist = []
with open('university_towns.txt', 'r') as file:
for line in file:
if '[edit]' in line:
state = row
else:
universitylist.append([state, row])
df = pd.DataFrame(universitylist, columns=['State', 'RegionName'])
如果您不想使用'[edit]'
和'[1]'
部分,则可以将代码更改为:
univeristylist = []
with open('university_towns.txt', 'r') as file:
for line in file:
if '[edit]' in line:
state = row.split(' [')[0]
else:
universitylist.append([state, row.split(' [')[0]])
df = pd.DataFrame(columns=['State', 'RegionName'])
答案 1 :(得分:0)
您可以在教程Pythonic Data Cleaning With NumPy and Pandas中找到清理此数据集的示例。
您可以在文件的各行上使用贪婪的for循环,并在O(n)时间内加载:
import pandas as pd
university_towns = []
with open('input/university_towns.txt') as file:
for line in file:
edit_pos = line.find('[edit]')
if edit_pos != -1:
# Remember this `state` until the next is found
state = line[:edit_pos]
else:
# Otherwise, we have a city; keep `state` as last-seen
parens = line.find(' (')
town = line[:parens] if parens != -1 else line
university_towns.append((state, town))
towns_df = pd.DataFrame(university_towns,
columns=['State', 'RegionName'])
或者,您可以使用Pandas的.str
访问器进行字符串处理:
import re
import pandas as pd
university_towns = []
with open('input/university_towns.txt') as file:
for line in file:
if '[edit]' in line:
# Remember this `state` until the next is found
state = line
else:
# Otherwise, we have a city; keep `state` as last-seen
university_towns.append((state, line))
towns_df = pd.DataFrame(university_towns,
columns=['State', 'RegionName'])
towns_df['State'] = towns_df.State.str.replace(r'\[edit\]\n', '')
towns_df['RegionName'] = towns_df.RegionName\
.str.strip()\
.str.replace(r' \(.*', '')\
.str.replace(r'\[.*', '')
输出:
>>> towns_df.head()
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
答案 2 :(得分:0)
我能想到的最短版本:
import pandas as pd
lst = list()
with open('university_towns.txt', 'r', newline='\n') as infile:
for line in infile.readlines():
if '[edit]' in line:
state = line.split('[')[0]
else:
lst.append([state, line.split(' ')[0]])
df = pd.DataFrame(lst, columns=['State', 'RegionName'])
print(df)
在我的机器上生成(Python 3.6):
State RegionName
0 Alabama Auburn
1 Alabama Florence
2 Alabama Jacksonville
3 Alabama Livingston
4 Alabama Montevallo
5 Alabama Troy
6 Alabama Tuscaloosa
7 Alabama Tuskegee
8 Alaska Fairbanks
9 Arizona Flagstaff
10 Arizona Tempe