Question

我想在Iterable对象中找到所有可能的组合。

我的输入是

Object1|DrDre|1.0
Object1|Plane and a Disaster|2.0
Object1|Tikk Takk Tikk|3.5
Object1|Tennis Dope|5.0
Object2|DrDre|11.0
Object2|Plane and a Disaster|14.0
Object2|Just My Luck|2.0
Object2|Tennis Dope|45.0

预期的输出将是这样的：

[(('DrDre', 'Plane and a Disaster'), (11.0, 14.0, 1.0, 2.0)),
(('DrDre', 'Tikk Takk Tikk'), (1.0, 3.5)),
(('DrDre', 'Tennis Dope'), (11.0, 45.0, 1.0, 5.0)),
(('Plane and a Disaster', 'Tikk Takk Tikk'), (2.0, 3.5)),
(('Plane and a Disaster', 'Tennis Dope'), (14.0, 45.0, 2.0, 5.0)),
(('Tikk Takk Tikk', 'Tennis Dope'), (3.5, 45.0)),
(('DrDre', 'Just My Luck'), (11.0, 2.0)),
(('Plane and a Disaster', 'Just My Luck'), (14.0, 2.0)),
(('Just My Luck', 'Tennis Dope'), (2.0, 45.0))]

这是我目前的代码，最终没有给我正确的组合。

def iterate(iterable):
    r = []
    for v1_iterable in iterable:
        for v2 in v1_iterable:
            r.append(v2)

    return tuple(r)

def parseVector(line):
    '''
    Parse each line of the specified data file, assuming a "|" delimiter.
    Converts each rating to a float
    '''
    line = line.split("|")
    return line[0],(line[1],float(line[2]))

def FindPairs(object_id,items_with_usage):
    '''
    For each objects, find all item-item pairs combos. (i.e. items with the same user) 
    '''
    for item1,item2 in combinations(items_with_usage,2):
        return (item1[0],item2[0]),(item1[1],item2[1])


''' 
Obtain the sparse object-item matrix:
    user_id -> [(object_id_1, rating_1),
               [(object_id_2, rating_2),
                ...]
'''
object_item_pairs = lines.map(parseVector).groupByKey().map(
    lambda p: sampleInteractions(p[0],p[1],500)).cache()


'''
Get all item-item pair combos:
    (item1,item2) ->    [(item1_rating,item2_rating),
                         (item1_rating,item2_rating),
                         ...]
'''

pairwise_objects = object_item_pairs.filter(
    lambda p: len(p[1]) > 1).map(
    lambda p: findItemPairs(p[0],p[1])).groupByKey()



x = pairwise_objects.mapValues(iterate)
x.collect()

这只能让我回到第一对，而不是别的。

[（（＆＃39; DrDre＆＃39;，＆＃39; Plane and a Disaster＆＃39;），（11.0,14.0,1.0,2.0））]

我是否误解了combination（）函数的功能？

感谢您的投入

Answer 1

我认为你可以用这种方式改造你的FindPairs

def FindPairs(object_id,items_with_usage):
'''
For each objects, find all item-item pairs combos. (i.e. items with the same user) 
'''
t = []   
for item1,item2 in combinations(items_with_usage,2):
    t.append(((item1[0],item2[0]),(item1[1],item2[1])))
return t

现在，您的函数将返回包含所有组合对的列表。

然后

pairwise_objects= pairwise_objects.filter(lambda p: len(p[1]) > 1)
pairwise_objects= pairwise_objects.map(lambda p: FindPairs(p[0],p[1]))

[[(('DrDre', 'Plane and a Disaster'), (1.0, 2.0)),
(('DrDre', 'Tikk Takk Tikk'), (1.0, 3.5)),
(('DrDre', 'Tennis Dope'), (1.0, 5.0)),
(('Plane and a Disaster', 'Tikk Takk Tikk'), (2.0, 3.5)),
(('Plane and a Disaster', 'Tennis Dope'), (2.0, 5.0)),
(('Tikk Takk Tikk', 'Tennis Dope'), (3.5, 5.0))], # end of the first line of the RDD
[(('DrDre', 'Plane and a Disaster'),(11.0, 14.0)),
(('DrDre', 'Just My Luck'), (11.0, 2.0)),
(('DrDre', 'Tennis Dope'), (11.0, 45.0)), 
(('Plane and a Disaster', 'Just My Luck'), (14.0, 2.0)),
(('Plane and a Disaster', 'Tennis Dope'), (14.0, 45.0)),
(('Just My Luck', 'Tennis Dope'), (2.0, 45.0))]]

在对RDD进行分组并应用您的功能之前，请使用flatMap（因此您将拥有包含所有对的单行）

pairwise_objects=pairwise_objects.flatMap(lambda p: p).groupByKey().mapValues(iterate)

最终输出：

[(('DrDre', 'Tennis Dope'), (1.0, 5.0, 11.0, 45.0)),
(('DrDre', 'Plane and a Disaster'), (1.0, 2.0, 11.0, 14.0)), 
(('Plane and a Disaster', 'Tennis Dope'), (2.0, 5.0, 14.0, 45.0)), 
(('Plane and a Disaster', 'Just My Luck'), (14.0, 2.0)),
(('Plane and a Disaster', 'Tikk Takk Tikk'), (2.0, 3.5)),
(('DrDre', 'Tikk Takk Tikk'), (1.0, 3.5)),
(('Tikk Takk Tikk', 'Tennis Dope'), (3.5, 5.0)), 
(('DrDre', 'Just My Luck'), (11.0, 2.0)), 
(('Just My Luck', 'Tennis Dope'), (2.0, 45.0))]

按组PySpark

1 个答案: