publications
2025
- arxivModified Loss of Momentum Gradient Descent: Fine-Grained AnalysisMatias D. Cattaneo and Boris ShigidaarXiv preprint arxiv:2509.08483, 2025
We analyze gradient descent with heavy-ball momentum (HB) whose fixed momentum parameter \(\beta \in (0, 1)\) provides exponential decay of memory. Building on Kovachki and Stuart (2021), we prove that on an exponentially attractive invariant manifold the algorithm is exactly plain gradient descent with a modified loss, provided that the step size \(h\) is small enough. Although the modified loss does not admit a closed-form expression, we describe it with arbitrary precision and prove global (finite ``time'' horizon) approximation bounds \(O(h^{R})\) for any finite order \(R \geq 2\). We then conduct a fine-grained analysis of the combinatorics underlying the memoryless approximations of HB, in particular, finding a rich family of polynomials in \(\beta\) hidden inside which contains Eulerian and Narayana polynomials. We derive continuous modified equations of arbitrary approximation order (with rigorous bounds) and the principal flow that approximates the HB dynamics, generalizing Rosca et al. (2023). Approximation theorems cover both full-batch and mini-batch HB. Our theoretical results shed new light on the main features of gradient descent with heavy-ball momentum, and outline a road-map for similar analysis of other optimization algorithms.
@article{cattaneo2025modifiedlossmomentum, title = {Modified Loss of Momentum Gradient Descent: Fine-Grained Analysis}, author = {Cattaneo, Matias D. and Shigida, Boris}, journal = {arXiv preprint arxiv:2509.08483}, year = {2025}, url = {https://arxiv.org/abs/2509.08483}, } - neuripsHow Memory in Optimization Algorithms Implicitly Modifies the LossMatias D. Cattaneo and Boris ShigidaIn The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
In modern optimization methods used in deep learning, each update depends on the history of previous iterations, often referred to as memory, and this dependence decays fast as the iterates go further into the past. For example, gradient descent with momentum has exponentially decaying memory through exponentially averaged past gradients. We introduce a general technique for identifying a memoryless algorithm that approximates an optimization algorithm with memory. It is obtained by replacing all past iterates in the update by the current one, and then adding a correction term arising from memory (also a function of the current iterate). This correction term can be interpreted as a perturbation of the loss, and the nature of this perturbation can inform how memory implicitly (anti-)regularizes the optimization dynamics. As an application of our theory, we find that Lion does not have the kind of implicit anti-regularization induced by memory that AdamW does, providing a theory-based explanation for Lion's better generalization performance recently documented. Empirical evaluations confirm our theoretical findings.
@inproceedings{cattaneo2025howmemory, title = {How Memory in Optimization Algorithms Implicitly Modifies the Loss}, author = {Cattaneo, Matias D. and Shigida, Boris}, booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems}, year = {2025}, url = {https://openreview.net/forum?id=2qd4lpXz7u}, } - arxivUniform Estimation and Inference for Nonparametric Partitioning-Based M-EstimatorsMatias D. Cattaneo, Yingjie Feng, and Boris ShigidaUnder revision: Annals of Statistics, 2025
This paper presents uniform estimation and inference theory for a large class of nonparametric partitioning-based M-estimators. The main theoretical results include: (i) uniform consistency for convex and non-convex objective functions; (ii) rate-optimal uniform Bahadur representations; (iii) rate-optimal uniform (and mean square) convergence rates; (iv) valid strong approximations and feasible uniform inference methods; and (v) extensions to functional transformations of underlying estimators. Uniformity is established over both the evaluation point of the nonparametric functional parameter and a Euclidean parameter indexing the class of loss functions. The results also account explicitly for the smoothness degree of the loss function (if any), and allow for a possibly non-identity (inverse) link function. We illustrate the theoretical and methodological results in four examples: quantile regression, distribution regression, \(L_p\)-regression, and Logistic regression. Many other possibly non-smooth, nonlinear, generalized, robust M-estimation settings are covered by our results. We provide detailed comparisons with the existing literature and demonstrate substantive improvements: we achieve the best (in some cases optimal) known results under improved (in some cases minimal) requirements in terms of regularity conditions and side rate restrictions. The supplemental appendix reports complementary technical results that may be of independent interest, including a novel uniform strong approximation result based on Yurinskii's coupling.
@article{cattaneo2025uniformestimation, title = {Uniform Estimation and Inference for Nonparametric Partitioning-Based M-Estimators}, author = {Cattaneo, Matias D. and Feng, Yingjie and Shigida, Boris}, journal = {Under revision: Annals of Statistics}, year = {2025}, url = {https://arxiv.org/abs/2409.05715}, }
2024
- icmlOn the Implicit Bias of AdamMatias D. Cattaneo, Jason M. Klusowski, and Boris ShigidaIn Forty-first International Conference on Machine Learning, 2024
In previous literature, backward error analysis was used to find ordinary differential equations (ODEs) approximating the gradient descent trajectory. It was found that finite step sizes implicitly regularize solutions because terms appearing in the ODEs penalize the two-norm of the loss gradients. We prove that the existence of similar implicit regularization in RMSProp and Adam depends on their hyperparameters and the training stage, but with a different ``norm'' involved: the corresponding ODE terms either penalize the (perturbed) one-norm of the loss gradients or, conversely, impede its reduction (the latter case being typical). We also conduct numerical experiments and discuss how the proven facts can influence generalization.
@inproceedings{cattaneo2024onimplicitbias, title = {On the Implicit Bias of Adam}, author = {Cattaneo, Matias D. and Klusowski, Jason M. and Shigida, Boris}, booktitle = {Forty-first International Conference on Machine Learning}, year = {2024}, url = {https://openreview.net/forum?id=y8YovS0lOg}, }
2021
- mcapDiscrete-time model of company capital dynamics with investment of a certain part of surplus in a non-risky asset for a fixed periodEkaterina V. Bulinskaya and Boris ShigidaMethodology and Computing in Applied Probability, 2021
A periodic-review insurance model is studied under the following assumptions. One-period insurance claims form a sequence of independent identically distributed nonnegative random variables with a finite mean. At the beginning of each period a quota δ of the company surplus is invested in a non-risky asset for m periods. Theoretical expressions for finite-time and ultimate ruin probabilities, in terms of multiple integrals, are presented and applied to the particular case where claims are exponential. Dividend problems are also considered. Numerical results obtained by virtue of simulation are provided and other algorithmic approaches are discussed. Sensitivity analysis of ruin probability is carried out for the case of exponential claims.
@article{bulinsk2021discretetime, title = {Discrete-time model of company capital dynamics with investment of a certain part of surplus in a non-risky asset for a fixed period}, author = {Bulinskaya, Ekaterina V. and Shigida, Boris}, journal = {Methodology and Computing in Applied Probability}, year = {2021}, url = {https://link.springer.com/article/10.1007/s11009-020-09843-5}, }
2019
- jmsSensitivity Analysis of Some Applied Probability ModelsEkaterina V. Bulinskaya and Boris ShigidaEnglish version: Journal of Mathematical Sciences, 2019
During the last two decades, new models were developed in actuarial sciences. Different notions of insurance company ruin (bankruptcy) and other objective functions evaluating the company performance were introduced. Several types of decision (such as dividend payment, reinsurance, investment) are used for optimization of company functioning. Therefore, it is necessary to be sure that the model under consideration is stable with respect to parameter fluctuation and perturbation of underlying stochastic processes. The aim of this paper is the description of methods for investigation of these problems and presentation of recent results concerning some insurance models. Numerical results are also included.
@article{bulinsk2019sensitivity, title = {Sensitivity Analysis of Some Applied Probability Models}, author = {Bulinskaya, Ekaterina V. and Shigida, Boris}, journal = {English version: Journal of Mathematical Sciences}, year = {2019}, url = {https://link.springer.com/article/10.1007/s10958-021-05318-1}, } - commstatModeling and asymptotic analysis of insurance company performanceEkaterina V. Bulinskaya and Boris ShigidaCommunications in Statistics Part B: Simulation and Computation, 2019
We consider a classical Cramér-Lundberg model with dividends. It is additionally supposed that the claim amounts have exponential distribution. Moreover, we are interested in a barrier dividend strategy with Parisian implementation delay. That means, the payment is made only if the company surplus stays above the barrier at least during time interval of length h. The mean expected discounted dividends paid before Parisian ruin are chosen as objective function. Optimization is carried out. The results are compared with those obtained previously by the authors for the no-delay case. Statistical estimation, stability problems and simulation are tackled as well.
@article{bulinsk2019modelingandasymp, title = {Modeling and asymptotic analysis of insurance company performance}, author = {Bulinskaya, Ekaterina V. and Shigida, Boris}, journal = {Communications in Statistics Part B: Simulation and Computation}, year = {2019}, abs = {We consider a classical Cramér-Lundberg model with dividends. It is additionally supposed that the claim amounts have exponential distribution. Moreover, we are interested in a barrier dividend strategy with Parisian implementation delay. That means, the payment is made only if the company surplus stays above the barrier at least during time interval of length h. The mean expected discounted dividends paid before Parisian ruin are chosen as objective function. Optimization is carried out. The results are compared with those obtained previously by the authors for the no-delay case. Statistical estimation, stability problems and simulation are tackled as well.}, url = {https://www.tandfonline.com/doi/abs/10.1080/03610918.2019.1612911}, }