How to Move a Banking Product to Real-Time

Wednesday, 2:47 AM. The Dashboard Is Red.

Spring 2024. We’ve passed functional and integration testing, we’re going to production — and that’s when the real load testing begins, except now it’s live traffic. Tarantool starts dropping requests at the twenty-minute mark under real load. Not immediately — first one timeout out of a thousand, then two, then ten. A beautiful exponential on the chart that in thirty minutes will turn the system into a pumpkin.

I’m sitting in front of my monitor on a late-night call with two developers, the third cup of coffee tonight growing cold on the desk. One thought keeps running through my head: the business demo is in six weeks. And we’ve just discovered that the foundation of our architecture is cracking.

But that comes later. First — why we decided to break something that was working.

Why Batch Is Yesterday

The bank’s pre-approved offers system had been running in batch mode for years. Models run overnight, lists are formed in the morning, customers get SMS during the day. Conversion — fractions of a percent. By the time someone sees the offer, they’ve long forgotten why they opened the app.

The brief sounded simple: customer takes an action — the bank responds within a second. Not tomorrow, not in an hour. Now.

Here’s what the difference looks like on paper:

Parameter	Batch	Real-time
Latency	12-24 hours	< 1 second
Data freshness	Yesterday’s snapshot	Current moment
Infrastructure cost	Lower (overnight windows)	Higher (24/7 clusters)
Architecture complexity	Linear (ETL pipeline)	Exponential (event mesh)
Error handling	Restart the batch	Dead Letter Queue, retries, alerts
Scaling	Vertical (bigger server)	Horizontal (more nodes)
Debugging	Logs from one run	Distributed tracing

On paper — a table. In practice — a complete overhaul of architecture, data, processes, and the team’s mindset. People who spent five years writing batch jobs now need to think in terms of events, windows, and watermarks. This isn’t learning a technology — it’s rewiring the brain.

Real-time pre-approved offers architecture

Architecture Decisions: Why This Stack

Flink, Not Spark Streaming

Choosing the streaming engine is a decision that will cost you nerves for the next three years. On the table: Apache Spark Streaming, Kafka Streams, and Apache Flink.

Spark Streaming was out in a minute — micro-batch with second-level latency didn’t fit our SLA. We seriously considered Kafka Streams: it’s lighter, built into the ecosystem, doesn’t require a separate cluster. But we needed stateful processing with windowing functions and complex enrichment logic. Kafka Streams was weak on that front at the time.

Flink won on three criteria. First — native event time support. When events arrive with delay (and from banking systems, they always do), this isn’t just nice-to-have — it’s the difference between making the right and wrong decision about a customer. Second — exactly-once semantics out of the box. In a banking context, a lost event isn’t a metric on a dashboard, it’s a potential customer complaint and regulatory attention. Third — mature checkpoint/savepoint for fault tolerance. The system must survive a node failure without losing state.

The price: on the Russian market, you can literally count Flink specialists on your fingers. Recruiting becomes less of a funnel and more of a hunt.

Tarantool: Speed and Betrayal

Every event needs enrichment: customer profile, transaction history, current limits, product parameters. The classic path — a relational database. But when you have thousands of events per second and a processing SLA of tens of milliseconds, PostgreSQL becomes a bottleneck.

Tarantool solved this problem elegantly: in-memory storage with sub-millisecond latency. Data from core systems replicated with minimal delay, Flink jobs querying for enrichment.

And then came that night.

The bug manifested only under near-production load. In the test environment — flawless. At half the production traffic — flawless. At seventy percent — sporadic timeouts. At target volumes — degradation that would turn the system into a vegetable within thirty minutes.

First week of debugging — false leads. We blamed the network, the JVM, the Flink configuration. Second week — we isolated the problem to a specific pattern of concurrent requests in Tarantool. The specific scenario: when parallel read and write requests hit the same data region at a certain frequency, the in-memory engine started degrading.

I remember the meeting where we decided what to do. Two options on the table: wait for a community fix (indefinitely) or an emergency pivot of part of the architecture. There was no time. We documented the bug, reported it to the community, and over the weekend rewrote the critical path to a PostgreSQL + caching combo.

Latency went up from microseconds to milliseconds. The architecture got dirtier. But stable. In banking, predictability beats elegance.

Kafka as the Nervous System

Apache Kafka — the transport layer of the entire system. Every microservice publishes and reads events through topics. Three things Kafka solves: decoupling services (each only knows about its own topics), replay (you can re-read events for debugging), and scaling through partitioning.

Kafka, by the way, was the only technology in the stack that never surprised us. Runs like a Swiss watch, as long as you don’t mess with retention settings too often.

Four Languages in One Project — Deliberate Chaos

Java, Scala, Lua, SQL. Sounds like a bad joke. But behind each choice — pragmatism, not aesthetics.

Java — the backbone of microservices. Spring Boot, a massive developer pool, mature Kafka libraries.

Scala — Flink jobs. Flink’s Scala API is an order of magnitude more expressive. Writing windowed aggregations in Scala is a pleasure. In Java — a punishment.

Lua — stored procedures in Tarantool. LuaJIT as an embedded language — the only option for complex enrichment logic.

SQL — analytical queries and rule configuration in the decision engine. Business users modify rules without developers.

The price: finding a developer fluent in both Java and Scala is already a quest. And Lua developers for Tarantool are a rare commodity that doesn’t exist on the open market. We grew our own. There was no other option.

Event-Driven: Beautiful in Theory, Painful in Practice

Event Sourcing for critical business events. Every change is an event. Full history, ability to reproduce any state at any point in time. A must-have for banking audits that show up at the worst possible moment and ask what exactly happened with a customer on February 17th at 2:23 PM.

CQRS for separating streams. Writes through the event pipeline, reads from projections in Tarantool and PostgreSQL. Independent scaling of load.

Dead Letter Queue — our best friend in the first weeks after launch. An event that fails processing doesn’t get lost — it goes to a separate topic for analysis. In the first week in production, the DLQ swelled to volumes we hadn’t planned for. Turns out legacy systems were sending events in a format that didn’t match the specification. Not occasionally — in five percent of cases. Five percent of millions of events is a lot of garbage in the DLQ.

The main pitfall — eventual consistency. In an event-driven system, there’s no guarantee that all services see the same state simultaneously. For banking products, this is frightening: you can’t offer a loan if the customer’s current debt data hasn’t reached the scoring service yet. The solution — explicit ordering through timestamps and watermarks in Flink. It works, but adds a layer of complexity that needs to be explained to every new developer.

Failures That Taught Us

Failure one: synthetic data lies. We spent three months testing on synthetic data and were confident in the system. First day on real data — a cascade of errors. Legacy systems sent dates in four different formats. Fields that were “required” per the spec arrived empty. IDs that were supposed to be unique — duplicated. The test environment is a map, but the territory looks different.

Failure two: we shipped monitoring too late. For the first two months, we built business logic. Monitoring was “we’ll add it later.” As a result, when problems started, we were reading tea leaves. After that, I established a rule: no service goes to review without metrics, tracing, and alerts. Monitoring isn’t the cherry on top — it’s the foundation.

Failure three: we discovered canary deployments too late. The first releases went out to all traffic. One of them tanked conversion by thirty percent — a bug in scoring logic that only manifested on a specific customer segment. Four hours to detect, one hour to roll back. After that — canary only: first one percent of traffic, then ten, then fifty.

Failure four: we underestimated backpressure. Monday morning, peak load. An upstream system dumps a weekend’s worth of accumulated events in one batch. Kafka handles it — it always does. Flink starts choking: checkpoints grow, latency creeps up, and within twenty minutes the consumer lag balloons to millions of messages. We hadn’t built a backpressure mechanism to gracefully throttle consumption under overload. We had to manually stop jobs, take a savepoint, and restart with read speed limits. After that incident, a backpressure strategy became part of the architecture review for every new Flink job.

What Remained After the Project

Architectural decisions in enterprise are always a trade-off between ideal and possible. The perfect system exists in conference talks. The real one is a balance between performance, reliability, maintenance cost, and time-to-market.

Technology is ten percent of the problem. Flink, Kafka, Tarantool work. Ninety percent is coordination between teams, aligning on API contracts, integrating with legacy, and managing the expectations of a business that wants “like Google, but in three months.”

Technical debt in a real-time system is not an abstract metric. Every delay in event processing affects business metrics right now, in real time. We allocated twenty percent of each sprint to tech debt. It wasn’t enough.

The concrete result — the one everything was built for: time from customer action to personalized offer dropped from 12-24 hours to 800 milliseconds at the 95th percentile. Conversion to target action increased dramatically compared to the batch approach. I can’t share exact numbers — NDA — but the difference was sufficient that the business stakeholder who spent six months skeptically asking “why do we need real-time?” came back after the first month in production with a request to expand to four more product scenarios. The best proof that the architecture works is when the business asks for more.

And also — production smells different than a test environment. Anomalies, spikes, corrupted data from systems that were last updated during a different era. All of this only shows up in battle. And if you’re not ready for it — the battle will show up in you.

Original article: How to Move a Banking Product to Real-Time on Habr

Based on a talk at Saint HighLoad++ 2024, Saint Petersburg.

— Vladimir Lovtsov

How to Move a Banking Product to Real-Time

Wednesday, 2:47 AM. The Dashboard Is Red.

Why Batch Is Yesterday

Architecture Decisions: Why This Stack

Flink, Not Spark Streaming

Tarantool: Speed and Betrayal

Kafka as the Nervous System

Four Languages in One Project — Deliberate Chaos

Event-Driven: Beautiful in Theory, Painful in Practice

Failures That Taught Us

What Remained After the Project

Stay Updated

Related Articles

How to Move a Banking Product to Real-Time

Fintech Development: 7 Circles of Hell

Cooperative vs Employment: A Fair Alternative, Not a Revolution