Cassandra distinct counting -

- August 15, 2015

i need count bunch of "things" in cassandra. need increase ~100-200 counters every few seconds or so.

however need count distinct "things".

in order not count twice, setting key in cf, program reads before increase counter, e.g. like:

result = cf[key]; if (result == null){     set          cf[key][x] = 1;     incr counter_cf[key][x]; }

however read operation slows down cluster lot. tried decrease reads, using several columns, e.g. like:

result = cf[key];  if (result[key1]){     set          cf[key1][x] = 1;     incr counter_cf[key1][x]; }  if (result[key2]){     set          cf[key2][x] = 1;     incr counter_cf[key2][x]; }  //etc....

then reduced reads 200+ 5-6, still slows down cluster.

i not need exact counting, can not use bit-masks, nor bloom-filters, because there 1m+++ counters , go more 4 000 000 000.

i aware of hyper_log_log counting, not see easy way use many counters (1m+++) either.

at moment thinking of using tokyo cabinet external key/value store, solution, if works, not scalable cassandra.

using cassandra distinct counting not ideal when number of distinct values big. time need read before write should ask if cassandra right choice.

if number of distinct items smaller can store them column keys , count. count not free, cassandra still has assemble row count number of columns, if number of distinct values in order of thousands it's going ok. assume you've considered option , it's not feasible you, thought i'd mention it.

the way people typically having hll's or bloom filters in memory , flushing them cassandra periodically. i.e. not doing actual operations in cassandra, using persistance. it's complex system, there's easy way of counting distinct values, if have massive number of counters.

even if switched else, example can bit operations on values, still need guard against race conditions. suggest bite bullet , of counting in memory. shard increment operations on processing nodes key , keep whole counter state (both incremental , distinct) in memory on nodes. periodically flush state cassandra , ack increment operations when do. when node gets increment operation key not have in memory loads state cassandra (or creates new state if there's nothing in database). if node crashes operations have not been acked , redelivered (you need message queue in front of nodes take care of this). since shard increment operations can sure counter state ever touched 1 node.

Search This Blog

Erty

Cassandra distinct counting -

Comments

Post a Comment

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

Cross-Compiling Linux Kernel for Raspberry Pi - ${CCPREFIX}gcc -v does not work -

python - IO.UnsupportedOperation: Not Writable -