hadoop - Spark Execution of TB file in memory -


let assume have 1 tb data file. each node memory in ten node cluster 3gb.

i want process file using spark. how 1 terabyte fits in memory?

will throw out of memory exception?

how work?

as thilo mentioned, spark not need load in memory able process it. because spark partition data smaller blocks , operate on these separately. number of partitions, , size depends on several things:

  • where file stored. commonly used options spark store file in bunch of blocks rather single big piece of data. if it's stored in hdfs instance, default these blocks 64mb , blocks distributed (and replicated) across nodes. files stored in s3 blocks of 32mb. defined hadoop filesystem used read files , applies files other filesystems spark uses hadoop reed them.
  • any repartitioning do. can call repartition(n) or coalesce(n) on rdd or dataframe change number of partitions , size. coalesce preferred reducing number without shuffling data across nodes, whereas repartition allows specify new way split data, i.e. have better control of parts of data processed on same node.
  • some more advanced transformations may do. instance, configuration setting spark.sql.shuffle.partitions, default set 200, determines how many partitions dataframe resulting joins , aggregations have

caching

the previous bit relates standard processing of data in spark, feel may let wrong ideas because of spark being advertised 'in-memory', wanted address bit. default there nothing in spark more 'in-memory' other data processing tool: simple example sc.textfile(foo).map(mapfunc).savetextfile(bar) reads file (block block , distributed on nodes), mapping in memory (like computer program) , saves storage again. spark's use of memory becomes more interesting in following (in scala i'm more fammiliar it, concept , method names same in python):

val rdd = sc.textfile(foo) // preprocessing, such parsing lines val preprocessed = rdd.map(preprocessfunc) // tell spark cache preprocessed data (by default in memory) preprocessed.cache() // perform mapping , save output preprocessed.map(mapfunc1).savetextfile(outfile1) // perform different mapping , save somewhere else preprocessed.map(mapfunc2).savetextfile(outfile2) 

the idea here use cache() preprocessing doesn't have done twice (possibly); default spark doesn't save intermediate results, calculates full chain each separate action, 'actions' here savetextfile calls.

i said 'possibly' because ability cache data limited memory in nodes. spark reserves amount of memory cache storage, separate work memory (see http://spark.apache.org/docs/latest/configuration.html#memory-management how sizes of these parts of memory managed), , can cache as amount can hold.

depending on partitioning may less though. let's have 2gb of storage memory on each of 3 nodes , data in preprocessed 6gb. if data has 3 partitions, fit , input data mapfunc2 loaded memory. if have 4 partitions, each of 1.5gb, 1 partition can cached on each node; 4th partition won't fit in 0.5gb still left on each machine, partition has recalculated second mapping , 3/4 of preprocessed data read memory.

so in sense better have many small partitions, make caching efficient possible, may have other downsides: more overhead, huge delays if happen use mesos fine grained mode, , tons of small output files (if don't coalesce before saving) spark saves each partition separate file.

as durga mentioned there possibility have data not fit in memory spill disk, can follow link :)


Comments

Popular posts from this blog

c - How to retrieve a variable from the Apache configuration inside the module? -

c# - Constructor arguments cannot be passed for interface mocks -

python - malformed header from script index.py Bad header -