Question

我有一个数据框 user_df ，行数约为500,000，格式如下：

|  id  |  other_ids   |
|------|--------------|
|  1   |['abc', efg'] |
|  2   |['bbb']       |
|  3   |['ccc', 'ddd']|

我还有一个列表， other_ids_that_clicked ，包含大约5000个其他ID的项目：

 ['abc', 'efg', 'ccc']

我希望使用 user_df 重新删除 other_ids_that_click ，方法是在df中添加另一列，以便在other_ids中的值位于user_df [＆＃39; other_ids＆＃39;]如此：

|  id  |  other_ids   |  clicked  |
|------|--------------|-----------|
|  1   |['abc', efg'] |     1     |
|  2   |['bbb']       |     0     |
|  3   |['ccc', 'ddd']|     1     |

我检查的方式是循环 user_df 中的每一行 other_ids_that_clicked 。

def otheridInList(row):
  isin = False
  for other_id in other_ids_that_clicked:
    if other_id in row['other_ids']:
        isin = True
        break
    else: 
        isin = False
  if isin:
    return 1
  else:
    return 0

这是永远的，所以我一直在寻找有关最佳方法的建议。

谢谢！

Answer 1

你实际上可以加快这一点。取出该列，将其转换为自己的数据帧，然后使用df.isin进行一些检查 -

l = ['abc', 'efg', 'ccc']
df['clicked'] = pd.DataFrame(df.other_ids.tolist()).isin(l).any(1).astype(int)

   id   other_ids  clicked
0   1  [abc, efg]        1
1   2       [bbb]        0
2   3  [ccc, ddd]        1

<强>详情

首先，将other_ids转换为列表列表 -

i = df.other_ids.tolist()

i
[['abc', 'efg'], ['bbb'], ['ccc', 'ddd']]

现在，将其加载到新的数据框中 -

j = pd.DataFrame(i)

j
     0     1
0  abc   efg
1  bbb  None
2  ccc   ddd

使用isin -

执行检查

k = j.isin(l)

k
       0      1
0   True   True
1  False  False
2   True  False

clicked可以通过使用True检查任何行中是否存在df.any来计算。结果将转换为整数。

k.any(1).astype(int)

0    1
1    0
2    1
dtype: int64

Answer 2

使用my_future<void> r = some_thread_pool.add_task([&x] {foo(x)});

import requests
import bs4
import webbrowser

def display(content):
    # to see this HTML in web browser
    with open('temp.html', 'wb') as f:
        f.write(content)
        webbrowser.open('temp.html')

with requests.session() as r:

    LOGIN = ""
    PASSWORD = ""

    login_url = "https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/user/login"
    profile_url="https://www.ouac.on.ca/apply/nonsecondary/intl/en_CA/profile/"

    # session need it only once and it will remember it
    r.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"
    })

    # load page with form - to get cookies and `csrf` from HTML
    response = r.get(login_url)

    #display(response.content)

    # get `csrf` from HTML
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    csrf = soup.find('input', {'name': 'csrf'}).attrs['value']

    print('csrf:', csrf)

    # cookies are not part of form so you don't use in form_data,
    # session will use cookies from previous request so you don't have to copy them
    form_data = {
        'login': LOGIN,
        'password': PASSWORD,
        'submitButton': "Log In",
        'csrf': csrf,
    }

    # send form data to server
    response = r.post(login_url, data=form_data)

    print('status_code:', response.status_code)
    print('history:', response.history)
    print('url:', response.url)

    #display(response.content)

    response = r.get(profile_url)

    display(response.content)

Python：有效地检查列表中的值是否在另一个列表中

2 个答案: