Question

我有一个看起来像这样的文件：

All Exceptions

注意第2列和第3列是值的“列表”。某些行包含每个列表中元素数量的精确匹配，其他行丢失，或者根本不存在（null）。我需要创建一个与以下

非常类似的数据帧

Location Code   Trait ID    Effective Date
WAU1    23984,24896,27576   06/05/2014 ,06/05/2014 ,06/12/2014 
WAU2    126973,219332   06/05/2014 ,06/05/2014 
WAU3    24375   06/05/2014 
WAU4    23984   06/05/2014 
WAU5    5199,23984  NULL
WAU6    12342,224123    06/05/2014

我已经能够使用以下内容将每个“list”列拆分为单独的数据帧：

   Location Code Trait ID  Effective Date
       0    WAU1    23984   06/05/2014
       1    WAU1    24896   06/05/2014
       2    WAU1    27576   06/12/2014
       3    WAU2    126973  06/05/2014
       4    WAU2    219332  06/05/2014
       5    WAU3    24375   06/05/2014
       6    WAU4    23984   06/05/2014
       7    WAU5    5199    NaN
       8    WAU5    23984   NaN
       9    WAU6    12342   06/05/2014
       10   WAU6    224123  NaN

这给了我类似的东西：

df1 = df1['Trait ID'].str.split(',').apply(pd.Series,1).stack()
df1.index = df1.index.droplevel(-1)
df1.name = 'Trait ID'
del df1['Trait ID']
df1 = df1.join(trait_id)

我可以使用上面相同的逻辑创建另一个带有“生效日期”列表的数据框，以生成以下内容：

  Location Code Trait ID
0          WAU1    23984
0          WAU1    24896
0          WAU1    27576
1          WAU2   126973
1          WAU2   219332
2          WAU3    24375
3          WAU4    23984
4          WAU5     5199
4          WAU5    23984
5          WAU6    12342
5          WAU6   224123

我正在努力在pandas（例如join，merge，concat）中找到合适的“函数”，以将两个数据帧组合到我想要的输出中。虽然我感觉它是它们的组合，但在那里有一个reset_index（）。

Answer 1

从：

开始

import requests
from lxml import html




login_url = 'https://cas.shopatron.com/cas/login'
authd_url = 'https://www.shopatron.com/rtl/'
name = '*****'
password = '*****'



payload = {
    "username": name, 
    "password": password,
    "submit" : "Submit",
    "lt": "LT-1426788-q3xOkNQDdGN7wB0AJMAKkegYKNosBN-i-f1229b28",
    "execution": "e12s1",
    "_eventId": "submit"
}


session_requests = requests.session()


result = session_requests.get(login_url)

tree = html.fromstring(result.text)


result = session_requests.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)

resultauthd = session_requests.get(authd_url)

print resultauthd
print resultauthd.encoding
print resultauthd.content
print resultauthd.headers

你可以Location Code Trait ID Effective Date 0 WAU1 23984, 24896, 27576 06/05/2014,06/05/2014,06/12/2014 1 WAU2 126973, 219332 06/05/2014,06/05/2014 2 WAU3 24375 2014-06-05 00:00:00 3 WAU4 23984 2014-06-05 00:00:00 4 WAU5 5199, 23984 NaN 5 WAU6 12342, 224123 2014-06-05 00:00:00，为每个群组使用groupby('Location Code') expand = True str.split(',') with stack（）, pivot the result using concat`：

and

得到：

df1.groupby('Location Code').apply(lambda x: pd.concat([x['Trait ID'].str.split(',', expand=True).stack(), x['Effective Date'].str.split(',', expand=True).stack()], axis=1)).reset_index([1, 2], drop=True)

从两列中创建单个数据帧，每列包含列表

1 个答案: