我有一个看起来像这样的文件:
All Exceptions
注意第2列和第3列是值的“列表”。某些行包含每个列表中元素数量的精确匹配,其他行丢失,或者根本不存在(null)。我需要创建一个与以下
非常类似的数据帧Location Code Trait ID Effective Date
WAU1 23984,24896,27576 06/05/2014 ,06/05/2014 ,06/12/2014
WAU2 126973,219332 06/05/2014 ,06/05/2014
WAU3 24375 06/05/2014
WAU4 23984 06/05/2014
WAU5 5199,23984 NULL
WAU6 12342,224123 06/05/2014
我已经能够使用以下内容将每个“list”列拆分为单独的数据帧:
Location Code Trait ID Effective Date
0 WAU1 23984 06/05/2014
1 WAU1 24896 06/05/2014
2 WAU1 27576 06/12/2014
3 WAU2 126973 06/05/2014
4 WAU2 219332 06/05/2014
5 WAU3 24375 06/05/2014
6 WAU4 23984 06/05/2014
7 WAU5 5199 NaN
8 WAU5 23984 NaN
9 WAU6 12342 06/05/2014
10 WAU6 224123 NaN
这给了我类似的东西:
df1 = df1['Trait ID'].str.split(',').apply(pd.Series,1).stack()
df1.index = df1.index.droplevel(-1)
df1.name = 'Trait ID'
del df1['Trait ID']
df1 = df1.join(trait_id)
我可以使用上面相同的逻辑创建另一个带有“生效日期”列表的数据框,以生成以下内容:
Location Code Trait ID
0 WAU1 23984
0 WAU1 24896
0 WAU1 27576
1 WAU2 126973
1 WAU2 219332
2 WAU3 24375
3 WAU4 23984
4 WAU5 5199
4 WAU5 23984
5 WAU6 12342
5 WAU6 224123
我正在努力在pandas(例如join,merge,concat)中找到合适的“函数”,以将两个数据帧组合到我想要的输出中。虽然我感觉它是它们的组合,但在那里有一个reset_index()。
答案 0 :(得分:1)
从:
开始import requests
from lxml import html
login_url = 'https://cas.shopatron.com/cas/login'
authd_url = 'https://www.shopatron.com/rtl/'
name = '*****'
password = '*****'
payload = {
"username": name,
"password": password,
"submit" : "Submit",
"lt": "LT-1426788-q3xOkNQDdGN7wB0AJMAKkegYKNosBN-i-f1229b28",
"execution": "e12s1",
"_eventId": "submit"
}
session_requests = requests.session()
result = session_requests.get(login_url)
tree = html.fromstring(result.text)
result = session_requests.post(
login_url,
data = payload,
headers = dict(referer=login_url)
)
resultauthd = session_requests.get(authd_url)
print resultauthd
print resultauthd.encoding
print resultauthd.content
print resultauthd.headers
你可以 Location Code Trait ID Effective Date
0 WAU1 23984, 24896, 27576 06/05/2014,06/05/2014,06/12/2014
1 WAU2 126973, 219332 06/05/2014,06/05/2014
2 WAU3 24375 2014-06-05 00:00:00
3 WAU4 23984 2014-06-05 00:00:00
4 WAU5 5199, 23984 NaN
5 WAU6 12342, 224123 2014-06-05 00:00:00
,为每个群组使用groupby('Location Code')
expand = True str.split(',') with
stack(), pivot the result using
concat`:
and
得到:
df1.groupby('Location Code').apply(lambda x: pd.concat([x['Trait ID'].str.split(',', expand=True).stack(), x['Effective Date'].str.split(',', expand=True).stack()], axis=1)).reset_index([1, 2], drop=True)