kafka's distributed, high throughput, high availability features and various message consumption modes can ensure the security of message consumption in a multi node cluster environment: that is, to prevent each message from missing processing or repeated consumption. In particular, the exactly once consumption strategy can ensure that each message is consumed only once. In other words, in the distributed computing environment, kafka's message consumption can ensure uniqueness.
However, the uniqueness of message reading is guaranteed. If the message processing process is also placed in the distributed computing environment, it will still face the problem of data integrity. For example, if the message processing process is to update the amount in the bank account and the message content is the instruction to update an account, the parallel processing of multiple messages for the same bank account will certainly lead to data integrity problems. This is the focus of this paper.
Let's look at the following code:
kfkSource .async.mapAsync(parallelism=8) { msg => updateAccount(msg.value() } .toMat(Sink.fold(0) { (accu, e) => if (e) accu + 1 else accu })(Keep.right) .run()
In the above example, multiple messages read one by one from the kafka queue may be processed in parallel (8 parallel threads at most). If these 8 messages contain the same account number, data integrity problems will certainly occur. Then if:
> kfkSource .async.mapAsync(parallelism=1) { msg => updateAccount(msg.value() } .toMat(Sink.fold(0) { (accu, e) => if (e) accu + 1 else accu })(Keep.right) .run()
With (parallelism=1), so that each message is processed by a single thread at the expense of some efficiency, can the problem be solved? The answer is: it seems possible on this server. But our goal is to process data in a multi node cluster environment. This should also be our original intention of using kafka. In a distributed environment, the above code is running on multiple nodes at the same time, which will also cause problems like multithreaded parallel operation.
Obviously: the core of the problem is the repeated message content. In the above example, it is the same bank account number in multiple messages. If the same account is processed in the same thread, the above problems can be avoided. The instructions in the akka actor mailbox are executed one by one, so we can solve the problem if we can ensure that messages with the same content are sent to the same actor. In order to send messages to actors purposefully, cluster sharding can be used. In akka cluster, each partition is equal to a named actor. Another question is, what if a large number of unique accounts or commodity numbers are involved, such as the unique number of more than one million? As I said just now, as long as we ensure that each message is sent to the same fragment, multiple messages can be sent to the same fragment. Therefore, for a large number of numbers, we can simplify the numbering accuracy through hash algorithm, as follows:
def hashItemCode(code: String): String = { val arrCode = code.toCharArray var occur : Array[Int] = Array.fill(8)(0) arrCode.foreach { case x if (x >= '0' && x <= '2') => occur(0) = occur(0) + 1 case x if (x >= '3' && x <= '5') => occur(1) = occur(1) + 1 case x if (x >= '6' && x <= '8') => occur(2) = occur(2) + 1 case x if (x == '9' || x == '-' || x == '_' || x == ':') => occur(3) = occur(3) + 1 case x if ((x >= 'a' && x <= 'g') || (x >= 'A' && x <= 'G')) => occur(4) = occur(4) + 1 case x if ((x >= 'h' && x <= 'n') || (x >= 'H' && x <= 'N')) => occur(5) = occur(5) + 1 case x if ((x >= 'o' && x <= 't') || (x >= 'O' && x <= 'T')) => occur(6) = occur(6) + 1 case x if ((x >= 'u' && x <= 'z') || (x >= 'U' && x <= 'Z')) => occur(7) = occur(7) + 1 case _ => occur(7) = occur(7) + 1 } occur.mkString }
This hashItemCode returns a string representing the frequency of occurrence of various letters in the original code. This string is used as the entityId of sharding.
Then read a message from kafaka and send it to a fragment according to the hashItemCode result. The following is a practical example:
def toStockWorker(jsonDoc: String) = { val bizDoc = fromJson[BizDoc](jsonDoc) val plu = bizDoc.pluCode val entityId = DocModels.hashItemCode(plu) log.step(s"CurStk-toStockWorker: sending CalcStock to ${entityId} with message: $jsonDoc") val entityRef = sharding.entityRefFor(StockCalculator.EntityKey, entityId) entityRef ! StockCalculator.CalcStock(jsonDoc) }
Below, I provide an exactly once source code as a reference;
(1 to numReaders).toList.map {_ => RestartSource .onFailuresWithBackoff(restartSource) { () => mergedSource } // .viaMat(KillSwitches.single)(Keep.right) .async.mapAsync(1) { msg => //only one message uniq checked for { //and flow down stream newtxn <- curStk.isUniqStkTxns(msg.value()) _ <- FastFuture.successful { log.step(s"ExactlyOnceReaderGroup-futStkTxnExists is ${!newtxn}: ${msg.value()}") } } yield (newtxn,msg) } .async.mapAsyncUnordered(8) { rmsg => //passed down msg for { //can be parrallelly processed cmt <- if (rmsg._1) stkTxns.stkTxnsWithRetry(rmsg._2.value(), rmsg._2.partition(), rmsg._2.offset()).toFuture().map(_ => "Completed") else FastFuture.successful {"stktxn exists!"} pmsg <- FastFuture.successful { log.step(s"ExactlyOnceReaderGroup-stkTxnsWithRetry: committed transaction-$cmt") rmsg } } yield pmsg } .async.mapAsyncUnordered(8) { rmsg => for { _ <- if(rmsg._1) FastFuture.successful {curStk.toStockWorker(rmsg._2.value())} else FastFuture.successful(false) pmsg <- FastFuture.successful { log.step(s"ExactlyOnceReaderGroup-updateStk...") rmsg } } yield pmsg } .async.mapAsyncUnordered(8) { rmsg => for { _ <- if (rmsg._1) FastFuture.successful { pcmTxns.toPcmAggWorker(rmsg._2.value()) } else FastFuture.successful(false) pmsg <- FastFuture.successful { log.step(s"ExactlyOnceReaderGroup-AccumulatePcm...") } } yield "Completed" } .toMat(Sink.seq)(Keep.left) .run() }