从两列中创建单个数据帧,每列包含列表

时间:2016-01-29 20:02:59

标签: python pandas dataframe

我有一个看起来像这样的文件:

All Exceptions

注意第2列和第3列是值的“列表”。某些行包含每个列表中元素数量的精确匹配,其他行丢失,或者根本不存在(null)。我需要创建一个与以下

非常类似的数据帧
Location Code   Trait ID    Effective Date
WAU1    23984,24896,27576   06/05/2014 ,06/05/2014 ,06/12/2014 
WAU2    126973,219332   06/05/2014 ,06/05/2014 
WAU3    24375   06/05/2014 
WAU4    23984   06/05/2014 
WAU5    5199,23984  NULL
WAU6    12342,224123    06/05/2014 

我已经能够使用以下内容将每个“list”列拆分为单独的数据帧:

   Location Code Trait ID  Effective Date
       0    WAU1    23984   06/05/2014
       1    WAU1    24896   06/05/2014
       2    WAU1    27576   06/12/2014
       3    WAU2    126973  06/05/2014
       4    WAU2    219332  06/05/2014
       5    WAU3    24375   06/05/2014
       6    WAU4    23984   06/05/2014
       7    WAU5    5199    NaN
       8    WAU5    23984   NaN
       9    WAU6    12342   06/05/2014
       10   WAU6    224123  NaN

这给了我类似的东西:

df1 = df1['Trait ID'].str.split(',').apply(pd.Series,1).stack()
df1.index = df1.index.droplevel(-1)
df1.name = 'Trait ID'
del df1['Trait ID']
df1 = df1.join(trait_id)

我可以使用上面相同的逻辑创建另一个带有“生效日期”列表的数据框,以生成以下内容:

  Location Code Trait ID
0          WAU1    23984
0          WAU1    24896
0          WAU1    27576
1          WAU2   126973
1          WAU2   219332
2          WAU3    24375
3          WAU4    23984
4          WAU5     5199
4          WAU5    23984
5          WAU6    12342
5          WAU6   224123

我正在努力在pandas(例如join,merge,concat)中找到合适的“函数”,以将两个数据帧组合到我想要的输出中。虽然我感觉它是它们的组合,但在那里有一个reset_index()。

1 个答案:

答案 0 :(得分:1)

从:

开始
import requests
from lxml import html




login_url = 'https://cas.shopatron.com/cas/login'
authd_url = 'https://www.shopatron.com/rtl/'
name = '*****'
password = '*****'



payload = {
    "username": name, 
    "password": password,
    "submit" : "Submit",
    "lt": "LT-1426788-q3xOkNQDdGN7wB0AJMAKkegYKNosBN-i-f1229b28",
    "execution": "e12s1",
    "_eventId": "submit"
}


session_requests = requests.session()


result = session_requests.get(login_url)

tree = html.fromstring(result.text)


result = session_requests.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)

resultauthd = session_requests.get(authd_url)

print resultauthd
print resultauthd.encoding
print resultauthd.content
print resultauthd.headers

你可以 Location Code Trait ID Effective Date 0 WAU1 23984, 24896, 27576 06/05/2014,06/05/2014,06/12/2014 1 WAU2 126973, 219332 06/05/2014,06/05/2014 2 WAU3 24375 2014-06-05 00:00:00 3 WAU4 23984 2014-06-05 00:00:00 4 WAU5 5199, 23984 NaN 5 WAU6 12342, 224123 2014-06-05 00:00:00 ,为每个群组使用groupby('Location Code') expand = True str.split(',') with stack(), pivot the result using concat`:

and

得到:

df1.groupby('Location Code').apply(lambda x: pd.concat([x['Trait ID'].str.split(',', expand=True).stack(), x['Effective Date'].str.split(',', expand=True).stack()], axis=1)).reset_index([1, 2], drop=True)