带有if / else和append函数的Python for循环

时间:2018-10-28 18:32:49

标签: python python-3.x pandas

根据下面的列表,我必须创建一个带有“状态”和“区域”列的数据框:

原始数据:

 Alabama[edit]
 Auburn (Auburn University)[1]
 Florence (University of North Alabama)
 Jacksonville (Jacksonville State University)[2]
 Livingston (University of West Alabama)[2]
 Montevallo (University of Montevallo)[2]
 Troy (Troy University)[2]
 Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
 Tuskegee (Tuskegee University)[5]
 Alaska[edit]
 Fairbanks (University of Alaska Fairbanks)[2]
 Arizona[edit]
 Flagstaff (Northern Arizona University)[6]
 Tempe (Arizona State University)

(此处为数据link。)

所需的输出:

State   Region
Alabama Auburn
Alabama Florence
Alabama Jacksonville
Alabama Livingston
Alabama Montevallo
Alabama Troy
Alabama Tuscaloosa
Alabama Tuskegee
Alaska  Fairbanks
Arizona Flagstaff
Arizona Tempe

代码:

    df = pd.DataFrame(columns=['State', 'RegionName'])
    with open('university_towns.txt', 'r') as UniversityList:
            content = UniversityList.readlines()
            state_row = []
            region_row = []
            for row in content:
                if '[edit]' in row:
                    state_row.append(row)
                    region_row.append('region_to_be_repeated')
                else:
                    region_row.append(row)
                    state_row.append('state_to_be_repeated')

在“ if”为True的情况下,如何将'state_to_be_reapeted'替换为附加内容?

3 个答案:

答案 0 :(得分:0)

如果我理解您的问题并希望输出正确,则可以执行以下操作:

univeristylist = []
with open('university_towns.txt', 'r') as file:
    for line in file:
        if '[edit]' in line:
            state = row
        else:
            universitylist.append([state, row])

df = pd.DataFrame(universitylist, columns=['State', 'RegionName'])

如果您不想使用'[edit]''[1]'部分,则可以将代码更改为:

univeristylist = []
with open('university_towns.txt', 'r') as file:
    for line in file:
        if '[edit]' in line:
            state = row.split(' [')[0]
        else:
            universitylist.append([state, row.split(' [')[0]])

df = pd.DataFrame(columns=['State', 'RegionName'])

答案 1 :(得分:0)

您可以在教程Pythonic Data Cleaning With NumPy and Pandas中找到清理此数据集的示例。

选项1:在“纯Python”中执行字符串处理

您可以在文件的各行上使用贪婪的for循环,并在O(n)时间内加载:

import pandas as pd

university_towns = []

with open('input/university_towns.txt') as file:
    for line in file:
        edit_pos = line.find('[edit]')
        if edit_pos != -1:
            # Remember this `state` until the next is found
            state = line[:edit_pos]
        else:
            # Otherwise, we have a city; keep `state` as last-seen
            parens = line.find(' (')
            town = line[:parens] if parens != -1 else line
            university_towns.append((state, town))

towns_df = pd.DataFrame(university_towns,
                        columns=['State', 'RegionName'])

选项2:通过Pandas API进行字符串处理

或者,您可以使用Pandas的.str访问器进行字符串处理:

import re

import pandas as pd

university_towns = []

with open('input/university_towns.txt') as file:
    for line in file:
        if '[edit]' in line:
            # Remember this `state` until the next is found
            state = line
        else:
            # Otherwise, we have a city; keep `state` as last-seen
            university_towns.append((state, line))

towns_df = pd.DataFrame(university_towns,
                        columns=['State', 'RegionName'])

towns_df['State'] = towns_df.State.str.replace(r'\[edit\]\n', '')
towns_df['RegionName'] = towns_df.RegionName\
    .str.strip()\
    .str.replace(r' \(.*', '')\
    .str.replace(r'\[.*', '')

输出:

>>> towns_df.head()
     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo

答案 2 :(得分:0)

我能想到的最短版本:

import pandas as pd

lst = list()

with open('university_towns.txt', 'r', newline='\n') as infile:
    for line in infile.readlines():
        if '[edit]' in line:
            state = line.split('[')[0]
        else:
            lst.append([state, line.split(' ')[0]])

df = pd.DataFrame(lst, columns=['State', 'RegionName'])
print(df)

在我的机器上生成(Python 3.6):

      State    RegionName
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Jacksonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe