duplicates - How to do de-duplication on records from AWS Kinesis Firehose to Redshift? -
i read document of official aws kinesis firehose doesn't mention how handle duplicated events. have experience on it? googled use elasticcache filtering, mean need use aws lambda encapsulate such filtering logic? there simple way firehose ingest data redshift , @ same time has "exactly once" semantics? lot!
you can have duplication on both sides of kinesis stream. might put same events twice stream, , might read event twice consumers.
the producers side can happen if try put event kinesis stream, reason not sure if written or not, , decide put again. consumer side can happen if getting batch of events , start processing them, , crash before managed checkpoint location, , next worker picking same batch of events kinesis stream, based on last checkpoint sequence-id.
before start solving problem, should evaluate how have such duplication , business impact of such duplications. not every system handling financial transactions can't tolerate duplication. nevertheless, if decide need have such de-duplication, common way solve use event-id , track if processed event-id already.
elasticcache redis place track event-id. every time pick event processing, check if have in hash table in redis, if find it, skip it, , if don't find it, add table (with ttl based on possible time window such duplication).
if choose use kinesis firehose (instead of kinesis streams), no longer have control on consumer application , can't implement process. therefore, either want run such de-duplication logic on producer side, switch use kinesis streams , run own code in lambda or kcl, or settle de-duplication functions in redshift (see below).
if not sensitive duplication, can use functions in redshift, such count distinct or last_value in window function.
Comments
Post a Comment