我有一个数据框 user_df ,行数约为500,000,格式如下:
| id | other_ids |
|------|--------------|
| 1 |['abc', efg'] |
| 2 |['bbb'] |
| 3 |['ccc', 'ddd']|
我还有一个列表, other_ids_that_clicked ,包含大约5000个其他ID的项目:
['abc', 'efg', 'ccc']
我希望使用 user_df 重新删除 other_ids_that_click ,方法是在df中添加另一列,以便在other_ids中的值位于user_df [' other_ids']如此:
| id | other_ids | clicked |
|------|--------------|-----------|
| 1 |['abc', efg'] | 1 |
| 2 |['bbb'] | 0 |
| 3 |['ccc', 'ddd']| 1 |
我检查的方式是循环 user_df 中的每一行 other_ids_that_clicked 。
def otheridInList(row):
isin = False
for other_id in other_ids_that_clicked:
if other_id in row['other_ids']:
isin = True
break
else:
isin = False
if isin:
return 1
else:
return 0
这是永远的,所以我一直在寻找有关最佳方法的建议。
谢谢!
答案 0 :(得分:5)
你实际上可以加快这一点。取出该列,将其转换为自己的数据帧,然后使用df.isin
进行一些检查 -
l = ['abc', 'efg', 'ccc']
df['clicked'] = pd.DataFrame(df.other_ids.tolist()).isin(l).any(1).astype(int)
id other_ids clicked
0 1 [abc, efg] 1
1 2 [bbb] 0
2 3 [ccc, ddd] 1
<强>详情
首先,将other_ids
转换为列表列表 -
i = df.other_ids.tolist()
i
[['abc', 'efg'], ['bbb'], ['ccc', 'ddd']]
现在,将其加载到新的数据框中 -
j = pd.DataFrame(i)
j
0 1
0 abc efg
1 bbb None
2 ccc ddd
使用isin
-
k = j.isin(l)
k
0 1
0 True True
1 False False
2 True False
clicked
可以通过使用True
检查任何行中是否存在df.any
来计算。结果将转换为整数。
k.any(1).astype(int)
0 1
1 0
2 1
dtype: int64
答案 1 :(得分:3)
使用my_future<void> r = some_thread_pool.add_task([&x] {foo(x)});
import requests
import bs4
import webbrowser
def display(content):
# to see this HTML in web browser
with open('temp.html', 'wb') as f:
f.write(content)
webbrowser.open('temp.html')
with requests.session() as r:
LOGIN = ""
PASSWORD = ""
login_url = "https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/user/login"
profile_url="https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/profile/"
# session need it only once and it will remember it
r.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"
})
# load page with form - to get cookies and `csrf` from HTML
response = r.get(login_url)
#display(response.content)
# get `csrf` from HTML
soup = bs4.BeautifulSoup(response.text, 'html.parser')
csrf = soup.find('input', {'name': 'csrf'}).attrs['value']
print('csrf:', csrf)
# cookies are not part of form so you don't use in form_data,
# session will use cookies from previous request so you don't have to copy them
form_data = {
'login': LOGIN,
'password': PASSWORD,
'submitButton': "Log In",
'csrf': csrf,
}
# send form data to server
response = r.post(login_url, data=form_data)
print('status_code:', response.status_code)
print('history:', response.history)
print('url:', response.url)
#display(response.content)
response = r.get(profile_url)
display(response.content)