Question

我正在使用相当大的数据帧histdf（20M，3）。字段是Visitor_ID，content和time。该数据框将用于URL推荐系统，其中Visitor_ID是唯一的访问者身份，内容是访问的URL，时间是时间戳。

采用这种结构，每个唯一访问者有多个URL，但是由于不产生重要信息（即访问的URL太少），因此应将其丢弃。

因此，我创建了一个名为user_visits的新变量，其中histdf.Visitor_ID中每个唯一值的行数，然后用大于10的计数对它进行了过滤：

user_visits = histdf.Visitor_ID.value_counts()
mask_user = user_visits > 10

mask_user是熊猫系列。索引是Visitor_ID，值是布尔值（如果原始数据帧中具有该Visitor_ID的行超过10行，则为真。）

现在，我想在heavyuser中用histdf中的True或False值添加一个新列mask_user。

到目前为止，我所做的是使用以下代码在数据框中设置值：

for index in histdf.index:
    temp = histdf.loc[index, 'Visitor_ID']
    temp2 = mask_user[temp]
    histdf.set_value(index, 'heavyuser', temp2)

这是一种工作。比使用迭代或按行进行其他类型的迭代要快得多。但是，它仍然很慢，处理时间超过1小时。

我想知道是否还有其他性能更好的选择。摘要将读取每个单独Visitor_ID的行数，如果这些行少于阈值（在这种情况下为10），则将False放入新的dataframe列中，或者完全消除这些行。

任何提示，我将不胜感激。谢谢。

Answer 1

您首先会提取大量用户的访问者ID的本能很好，但是一旦有了数据框，就不需要遍历数据框。

这是您的方法：

histdf = pd.DataFrame({'Visitor_ID':[1, 1, 2, 2, 2, 3], 
                   'content ': ["url" + str(x) for x in range(6)], 
                   'time':["timestamp n° " + str(x) for x in range(6)]}) 

# At first we consider that no user is a heavy user
histdf['heavy user'] = False

# Then we extract the ID's of heavy users
user_visits = histdf.Visitor_ID.value_counts()
id_heavy_users = user_visits[user_visits > 1].index

# Finally we consider those users as heavy users in the corresponding column
histdf.loc[histdf['Visitor_ID'].isin(id_heavy_users), 'heavy user'] = True

输出：

  Visitor_ID content             time  heavy user
0           1     url0  timestamp n° 0        True
1           1     url1  timestamp n° 1        True
2           2     url2  timestamp n° 2        True
3           2     url3  timestamp n° 3        True
4           2     url4  timestamp n° 4        True
5           3     url5  timestamp n° 5       False

如果您只想保留问题末尾提到的大量用户，则无需创建第三列，就像这样：

histdf = pd.DataFrame({'Visitor_ID':[1, 1, 2, 2, 2, 3], 
                   'content ': ["url" + str(x) for x in range(6)], 
                   'time':["timestamp n° " + str(x) for x in range(6)]}) 

user_visits = histdf.Visitor_ID.value_counts()
id_heavy_users = user_visits[user_visits > 1].index

heavy_users = histdf[histdf['Visitor_ID'].isin(id_heavy_users)]

In [1] : print(heavy_users)
Out[1] :    Visitor_ID content             time
0           1     url0  timestamp n° 0
1           1     url1  timestamp n° 1
2           2     url2  timestamp n° 2
3           2     url3  timestamp n° 3
4           2     url4  timestamp n° 4

数据框内的条件替换

1 个答案: