scala - How to pass one RDD in another RDD through .map -


i have 2 rdd's, , want computation on rdd2 items each item of rdd1. so, passing rdd2 in user defined function below getting error rdd1 cannot passed in rdd. can know how achieve if want perform operations on 2 rdd's?

for example:

rdd1.map(line =>function(line,rdd2))

nesting rdds not supported spark, error says. have go around redesigning algorithm.

how depends on actual use case, happen in function , it's output.

sometimes rdd1.cartesian(rdd2), doing operations per tuple , reducing key work. sometimes, if have (k,v) type join between both rdd work.

if rdd2 small can collect in driver, make broadcast variable , use variable in function instead of rdd2.

@edit:

for example's sake let's assume rdds hold strings , function count how many times given record rdd occurs in rdd2:

def function(line: string, rdd: rdd[string]): (string, int) = {    (line, rdd.filter(_ == line).count) }  

this return rdd[(string, int)].

idea1

you can try using cartesian product using rdd's cartesian method.

val cartesianproduct = rdd1.cartesian(rdd2) // creates rdd[(string, string)]                            .map( (r1,r2) => (r1, function2) ) // creates rdd[(string, int)]                            .reducebykey( (c1,c2) => c1 + c2 ) // final rdd[(string, int)] 

here function2 takes r1 , r2 (which strings) , returns 1 if equal , 0 if not. final map result in rdd have tuples key record r1 , value total count.

problem1: not work if have duplicate strings in rdd1, though. you'd have think it. if rdd1 records have unique ids perfect.

problem2: create lot of pairs (for 1mln records in both rdd create around 500bln pairs), slow , result in lot of shuffling.

idea2

i didn't understand comment regarding rdd2's size lacs might or might not work:

val rdd2array = sc.broadcast(rdd2.collect()) val result = rdd1.map(line => function(line, rdd2array)) 

problem: might blow memory. collect() called on driver , all records rdd2 loaded memory on driver node.

idea3

depending on use case there other ways overcome this, instance brute force algorithm similarity search similar (pun not intended) use case. 1 of alternative solutions locality sensitive hashing.


Comments

Popular posts from this blog

c - How to retrieve a variable from the Apache configuration inside the module? -

c# - Constructor arguments cannot be passed for interface mocks -

python - malformed header from script index.py Bad header -