python - Is this an efficient / appropriate use of a bloom filter? -


i have api data comes in, lot of redundant (can determined id). have bloom filter built few million entries start.

i using this library handle implementation. reading online:

bloom filter compact data structure probabilistic representation of set of variables ensure whether elements in set present or definitely not present in set. source

if have pseudocode this

newdata #some dataset row in newdata:     #filter.add() returns true if in set, want not in set     if not filter.add(row.id):         #do stuff fresh data     

this function called every time set of data comes in, 200 new entries/sec.

is efficient way use bloom filter?

a bloom filter can, given element, give 2 answers. can "i sure have never seen element before" or "i believe have seen element before". in latter case, may bloom filter incorrect (that is, it's saying "i think i've seen before" when, in fact, hasn't).

that is, question answers isn't "i have never seen before" "i think have seen before" (there's no way element has seen answer "never seen", there may cases answers "probably seen" elements hasn't seen).

that means answer can rely on being 100% true "never seen", you're interested in other case, never skip processing new element , isn't guarantee bloom filter gives you.

if processing expensive, use bloom filter gate more expensive cache lookup.

pseudocode:

if not bloom.seen(element):   # have never seen   bloom.add(element)   value = process(element)   more_expensive_cache.set(element, value)   return value else:   value, seen = more_expensive_cache.lookup(element)   if not seen:     value = process(element)     more_expensive_cache.set(element, value)   return value 

Comments

Popular posts from this blog

c - How to retrieve a variable from the Apache configuration inside the module? -

c# - Constructor arguments cannot be passed for interface mocks -

python - malformed header from script index.py Bad header -