python - Is this an efficient / appropriate use of a bloom filter? -
i have api data comes in, lot of redundant (can determined id). have bloom filter built few million entries start.
i using this library handle implementation. reading online:
bloom filter compact data structure probabilistic representation of set of variables ensure whether elements in set present or definitely not present in set. source
if have pseudocode this
newdata #some dataset row in newdata: #filter.add() returns true if in set, want not in set if not filter.add(row.id): #do stuff fresh data
this function called every time set of data comes in, 200 new entries/sec.
is efficient way use bloom filter?
a bloom filter can, given element, give 2 answers. can "i sure have never seen element before" or "i believe have seen element before". in latter case, may bloom filter incorrect (that is, it's saying "i think i've seen before" when, in fact, hasn't).
that is, question answers isn't "i have never seen before" "i think have seen before" (there's no way element has seen answer "never seen", there may cases answers "probably seen" elements hasn't seen).
that means answer can rely on being 100% true "never seen", you're interested in other case, never skip processing new element , isn't guarantee bloom filter gives you.
if processing expensive, use bloom filter gate more expensive cache lookup.
pseudocode:
if not bloom.seen(element): # have never seen bloom.add(element) value = process(element) more_expensive_cache.set(element, value) return value else: value, seen = more_expensive_cache.lookup(element) if not seen: value = process(element) more_expensive_cache.set(element, value) return value
Comments
Post a Comment