I have a dataframe from a numpy array.
matrix = scipy.sparse.rand(5, 3, density=0.2, format='lil')
array = numpy.array(matrix.toarray())
users = {5: 0, 10: 1, 15: 2, 20: 3, 25: 4}
games = {1: 0, 4: 1, 6: 2}
dataframe = pd.DataFrame(data=array, index=users.keys(), columns=games.keys())
What I need now is to get a list from that dataframe, where each cell from the matrix is represented as a tuple of the following format:
userID, gameID, value
userID, gameID, value
userID, gameID, value
...
to use it with http://surprise.readthedocs.io/en/stable/getting_started.html#load-custom
Are there any efficient way of doing that?
答案 0 :(得分:0)
Use stack
for reshape first:
...and add column for 3 levels MultiIndex
and convert it to tuples
:
L = dataframe.stack().to_frame('a').set_index('a', append=True).index.tolist()
...or reset_index
with list comprehension
:
L = [tuple(x) for x in dataframe.stack().reset_index().values]
print (L)
[(5, 1, 0.8797632578062221), (5, 4, 0.0),
(5, 6, 0.8996885724198237), (10, 1, 0.0), (10, 4, 0.0),
(10, 6, 0.0), (15, 1, 0.0), (15, 4, 0.07758205674008478),
(15, 6, 0.0), (20, 1, 0.0), (20, 4, 0.0), (20, 6, 0.0),
(25, 1, 0.0), (25, 4, 0.0), (25, 6, 0.0)]
If want only non 0
values only filter it by query
:
L = [tuple(x) for x in dataframe.stack().reset_index(name='a').query('a != 0').values]
print (L)
[(5.0, 1.0, 0.87976325780622211),
(5.0, 6.0, 0.8996885724198237),
(15.0, 4.0, 0.077582056740084782)]
答案 1 :(得分:0)
l = []
for row in dataframe.itertuples():
for col in dataframe.columns:
l.append((row.Index,col, dataframe.loc[row.Index,col]))
you can iterate over each row and then each column to append the resulting tuple to a list. On my test this was faster than the previous answer, probably depending on the number of rows and columns you have.
%%timeit
l = []
for row in dataframe.itertuples():
for col in dataframe.columns:
l.append((row.Index,col, dataframe.loc[row.Index,col]))
594 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
L = dataframe.stack().to_frame('a').set_index('a', append=True).index.tolist()
L = [tuple(x) for x in dataframe.stack().reset_index().values]
2.25 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
as requested, here the timings for 1000 rows:
matrix = scipy.sparse.rand(1000, 3, density=0.2, format='lil')
array = numpy.array(matrix.toarray())
index = list(range(1000))
dataframe= pd.DataFrame(data=array, index=index)
%%timeit
l = []
for row in dataframe.itertuples():
for col in dataframe.columns:
l.append((row.Index,col, dataframe.loc[row.Index,col]))
17 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
L = dataframe.stack().to_frame('a').set_index('a', append=True).index.tolist()
L = [tuple(x) for x in dataframe.stack().reset_index().values]
5.08 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)