创建带有长度不均匀的列表项的pandas df列?

时间:2019-02-05 19:42:14

标签: python pandas dataframe

我有一个要放入数据框中的地址列表,其中每一行是一个新地址,而各列是地址的单位(标题,街道,城市)。

但是,列表的结构方式是,某些地址比其他地址长。例如:

address = ['123 Some Street, City','45 Another Place, PO Box 123, City']

我有一个熊猫数据框,其中包含以下列:

Index     Court       Address                              Zipcode   Phone                           
0         Court 1     123 Court Dr, Springfield            12345     11111
1         Court 2     45 Court Pl, PO Box 45, Pawnee       54321     11111
2         Court 3     1725 Slough Ave, Scranton            18503     11111
3         Court 4     101 Court Ter, Unit 321, Eagleton    54322     11111

我想根据地址中有多少个逗号分隔符将“地址”列分为多达三列列,并用NaN填充将丢失值的位置。

例如,我希望数据看起来像这样:

Index     Court       Address          Address2     City           Zip  Phone                                          
0         Court 1     123 Court Dr     NaN          Springfield    ...   ...           
1         Court 2     45 Court Pl      PO Box 45    Pawnee         ...   ...
2         Court 3     1725 Slough Ave  NaN          Scranton       ...   ...
3         Court 4     101 Court Ter    Unit 321     Eagleton       ...   ...

我经过仔细研究,并在StackOverflow上尝试了许多不同的解决方案,但均无济于事。我得到的最接近的代码是:

df2 = pd.concat([df, df['Address'].str.split(', ', expand=True)], axis=1)

但是返回一个数据帧,该数据帧将以下三列添加到结构如下:

...  0              1             2
... 123 Court Dr   Springfield   None
... 45 Court Pl    PO Box 45     Pawnee

这是关闭的,但是如您所见,对于较短的条目,城市与第二行地址行对齐,对于较长的条目。

理想情况下,第2列应在每行中填充一个城市,第1列应在“无”和第二个地址行之间交替显示。

我希望这很有道理-这很难说出来。谢谢!

3 个答案:

答案 0 :(得分:0)

地址,尤其是人工输入的地址可能会很棘手。但是,如果您的地址仅适合这两种格式,则可以使用:

注意:如果您需要考虑其他格式,将打印出罪魁祸首。

def split_address(df):
    for index,row in df.iterrows():
        full_address = df['address']
        if full_address.count(',') == 3:
            split = full_address.split(',')
            row['address_1'] = split[0]
            row['address_2'] = split[1]
            row['city'] = split[2]
        else if full_address.count(',') == 2:
            split = full_address.split(',')
            row['address_1'] = split[0]
            row['city'] = split[1]
        else:
            print("address does not fit known formats {0}".format(full_address))

基本上,应该帮助您的两件事是string.count()函数,该函数将告诉您字符串中的逗号数,而您已经发现的string.split()将输入拆分为一个数组。您可以引用此数组的各个部分,以将片段分配给正确的列。

答案 1 :(得分:0)

您可以执行以下操作:

df['Address1'] = df['Address'].str.split(',').str[0]
df['Address2'] = df['Address'].str.extract(',(.*),')
df['City'] = df['Address'].str.split(',').str[-1]

答案 2 :(得分:0)

您可以考虑使用软件包usaddress创建函数。当我需要将地址分成多个部分时,这对我非常有帮助:

import usaddress

df = pd.DataFrame(['123 Main St. Suite 100 Chicago, IL', '123 Main St. PO Box 100 Chicago, IL'], columns=['Address'])

然后创建用于拆分数据的函数:

def Address1(x):
    try:
        data = usaddress.tag(x)
        if 'AddressNumber' in data[0].keys() and 'StreetName' in data[0].keys() and 'StreetNamePostType' in data[0].keys():
            return data[0]['AddressNumber'] + ' ' + data[0]['StreetName'] + ' ' + data[0]['StreetNamePostType']
    except:
        pass

def Address2(x):
    try:
        data = usaddress.tag(x)
        if 'OccupancyType' in data[0].keys() and 'OccupancyIdentifier' in data[0].keys():
            return data[0]['OccupancyType'] + ' ' + data[0]['OccupancyIdentifier']
        elif 'USPSBoxType' in data[0].keys() and 'USPSBoxID' in data[0].keys():
            return data[0]['USPSBoxType'] + ' ' + data[0]['USPSBoxID']
    except:
        pass

def PlaceName(x):
    try:
        data = usaddress.tag(x)
        if 'PlaceName' in data[0].keys():
            return data[0]['PlaceName']
    except:
        pass

df['Address1'] = df.apply(lambda x: Address1(x['Address']), axis=1)
df['Address2'] = df.apply(lambda x: Address2(x['Address']), axis=1)
df['City'] = df.apply(lambda x: PlaceName(x['Address']), axis=1)

退出:

                               Address      Address1    Address2     City
0   123 Main St. Suite 100 Chicago, IL  123 Main St.   Suite 100  Chicago
1  123 Main St. PO Box 100 Chicago, IL  123 Main St.  PO Box 100  Chicago