The Case of the Missing Model

3 reasons this paper matters, starting with the least obvious.

First, it takes one of control theory's cleanest math problems and drags it into the messy alley where real data lives. Second, it does not panic when noise shows up wearing a fake mustache. Third, it gives reinforcement learning a job it might actually be qualified for: learning a good controller without first pretending it knows the whole machine.

The paper is called A Fully Data-Driven Value Iteration for Stochastic LQR by Cui, Jiang, Kolm, and Macqueron. That title sounds like it was assembled in a lab after midnight, but the central idea is surprisingly plain. You have a system you want to control. Maybe it's a cooling plant in a data center. Maybe it's a portfolio that keeps trying to behave like a caffeinated squirrel. Normally, you'd build a mathematical model of the system first, then design the controller. These authors skip the middleman. They learn the control policy directly from data [Cui et al., 2026].

That is the seductive part of data-driven control. No painstaking model identification. No pretending your equations captured every leak, delay, and gremlin in the pipes. Just data in, policy out. Very modern. Very efficient. Also a little terrifying, because raw data is noisy, and noise is where many elegant algorithms go to die.

Bellman in a Raincoat

Under the hood, this is about LQR, short for linear-quadratic regulator. That's a classic control setup: the system evolves linearly, and you pay a quadratic penalty for states going off the rails and for using too much control effort. The Bellman equation sits in the center like the detective who already knows who did it but wants to see if the evidence agrees. Value iteration keeps updating the "cost-to-go" estimate until it settles on the optimal answer.

The catch is that most model-free versions of this game need favorable assumptions to avoid face-planting. Some require a decent initial controller. Some behave nicely only when the world behaves nicely, which is not a habit the world has ever shown. Cui and colleagues prove something stronger: in the noise-free case, their value iteration is globally exponentially stable for any positive semidefinite initial value matrix. Translation: you get much more freedom in how you start, and the method still heads toward the right answer with purpose instead of wandering around like a GPS in a tunnel [Cui et al., 2026].

Noise Always Shows Up

The more interesting part is the paper's attitude toward disturbances. Real systems cough. Sensors drift. Actuators sulk. Market data lies to your face. The authors show that when disturbances are small enough, the algorithm remains input-to-state stable and converges near the optimum rather than blowing up theatrically. That matters because "works perfectly in clean simulations" is research for "my demo survived because reality was not invited."

This is where the paper earns its keep. Recent work in the area has been pushing direct data-driven LQR from several angles: broader surveys of data-driven control methods [Bazanella et al., 2023; Liu et al., 2025], value-iteration methods for unknown stochastic-parameter systems [Fan et al., 2024], and direct policy optimization schemes like DeePO that learn LQR policies from data with convergence guarantees [Zhao et al., 2023; Zhao et al., 2024]. There are also newer robustness-focused approaches that attack noisy stochastic systems head-on [Esmzad and Modares, 2025]. This paper fits that lineup like the grizzled cop who says, "Fine, but what happens when the signal is dirty and your initial guess is lousy?"

Two Beat Cops: Cooling and Finance

The authors test the method on data center cooling and dynamic portfolio allocation. Good choices. Data center cooling is a control problem with money attached to every bad decision and electricity bills large enough to make a CFO develop a nervous twitch. Industry interest in AI-based cooling control has been climbing, with both research and commercial pilots exploring reinforcement learning for HVAC and thermal optimization in large facilities [Google DeepMind, 2018; Phaidra/STT GDC, 2024; Li et al., 2024]. Portfolio allocation is a different kind of chaos, but the same basic lesson applies: when you cannot trust a neat textbook model, direct learning from data starts to look less like academic mischief and more like a practical survival skill.

The Part Where Nobody Gets Carried Away

This does not mean reinforcement learning has solved control. Not even close. The theory here lives in the structured world of stochastic LQR, not arbitrary nonlinear systems with half the sensors broken and an intern leaning on the emergency stop button. Data quality still matters. Disturbances must be "sufficiently small," which is math-speak for "do not bring a hurricane to a knife fight." And proving stability in a tidy class of systems is not the same as shipping a universally reliable controller for every industrial plant on Earth.

Still, this paper sharpens an important point. If you want direct, model-free control to be taken seriously outside a poster session, you need guarantees about convergence, robustness, and stability. Not vibes. Not a heroic random seed. Guarantees. Cui and colleagues hand over exactly that, then show the method working in places where bad control gets expensive fast.

That is why this paper matters. It gives data-driven RL for control a little less swagger and a little more backbone. In this neighborhood, that's the difference between a suspect and a witness.

References

Cui L, Jiang ZP, Kolm PN, Macqueron G. A Fully Data-Driven Value Iteration for Stochastic LQR: Convergence, Robustness, and Stability. IEEE Transactions on Neural Networks and Learning Systems. 2026. DOI: 10.1109/TNNLS.2026.3675892. arXiv: 2505.02970

Bazanella AS, Campestrini L, Eckhard D. The data-driven approach to classical control theory. Annual Reviews in Control. 2023;56:100906. DOI: 10.1016/j.arcontrol.2023.100906

Fan W, Xiong J, Xiong Y. Value iteration for LQR control of unknown stochastic-parameter linear systems. Systems & Control Letters. 2024;185:105731. DOI: 10.1016/j.sysconle.2024.105731

Zhao F, Dörfler F, You K. Data-enabled Policy Optimization for the Linear Quadratic Regulator. arXiv: 2303.17958, 2023.

Zhao F, Dörfler F, Chiuso A, You K. Data-Enabled Policy Optimization for Direct Adaptive Learning of the LQR. arXiv: 2401.14871, 2024.

Esmzad R, Modares H. Direct Data-Driven Discounted Infinite Horizon Linear Quadratic Regulator with Robustness Guarantees. Automatica. 2025;176:112197. DOI: 10.1016/j.automatica.2025.112197. arXiv: 2409.10703

Liu W, Zhang M, Xu Q, Xie L. Survey on data-driven control and its application in cyber-physical energy systems. Cyber-Physical Energy Systems. 2025. DOI: 10.1016/j.cpes.2025.08.004

Li X et al. An Alternative Reinforcement Learning control strategy for data center air-cooled HVAC systems. Energy. 2024. DOI: 10.1016/j.energy.2024.132977

Google DeepMind. Safety-first AI for autonomous data centre cooling and industrial control. 2018. https://deepmind.google/en/blog/safety-first-ai-for-autonomous-data-centre-cooling-and-industrial-control/

Data Center Dynamics. STT GDC to pilot Phaidra's AI data center cooling control system. 2024. https://www.datacenterdynamics.com/en/news/stt-gdc-to-pilot-phaidras-ai-data-center-cooling-control-system/

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.