Revisiting Residual Connections

Orthogonal Updates for Stable and Efficient Deep Networks

NeurIPS 2025 Poster

Giyeong Oh^♦, Woohyun Cho^♦, Siyeol Kim^♦, Suhwan Choi^♥, Youngjae Yu^♠^†

▶ Yonsei University^♦ ▶ Maum.AI^♥ ▶ Seoul National University^♠

^†Corresponding author

Our poster distills Orthogonal Residual Update into a plug-and-play drop-in for residual blocks, along with reproducible training recipes and diagnostics.

What we set out to test

Hypothesis: Standard residual updates mostly rescale or flip the stream because their outputs align with the current representation; forcing updates to be orthogonal should unlock unused capacity.
Intervention: Project each block's output onto the subspace perpendicular to the incoming stream — add only f_⊥.
Evaluation: Train ResNetV2 and ViT models from scratch on CIFAR-10/100, Tiny ImageNet, and ImageNet-1k, tracking accuracy, training efficiency, and residual-stream geometry.

Highlights

Residual streams stay balanced

Orthogonal updates avoid runaway norms and prevent layers from collapsing into a single direction, yielding steadier training trajectories.

Accuracy gains where depth matters

Vision Transformers benefit the most: ViT-B improves ImageNet-1k Acc@1 by +3.78 points with the same training recipe.

Poster-ready diagnostics

The poster features block-wise cosine and norm traces, ablations on update probability, and throughput numbers to understand the mechanism.

Method Sketch

Each block decomposes its output into components parallel and orthogonal to the incoming stream vector. We discard the parallel part and add only the orthogonal component to the residual pathway, which acts like a rotation on the representation manifold. The original module is untouched and still learns freely.

Orthogonal residual update illustration — Orthogonal Residual Update keeps only the novelty term `f_⊥`.

Key Results

Accuracy numbers report Val Acc@1 averaged over five runs (five best epochs each). Gains are upstream of pre-training: every model is trained from scratch using the baseline recipe.

Architecture	Connection	CIFAR-10	CIFAR-100	Tiny ImageNet	ImageNet-1k
ViT-S	Linear	89.82 ± 0.34	71.92 ± 0.24	51.30 ± 0.40	70.76 ± 0.26
ViT-S	Orthogonal-F	90.61 ± 0.21	73.86 ± 0.31	52.57 ± 0.71	72.53 ± 0.49
ViT-B	Linear	87.28 ± 0.41	68.25 ± 0.88	55.29 ± 0.71	73.27 ± 0.58
ViT-B	Orthogonal-F	88.73 ± 6.06	75.07 ± 0.43	57.87 ± 0.37	77.05 ± 0.21
ResNetV2-18	Linear	95.06 ± 0.15	77.67 ± 0.28	62.04 ± 0.29	—
ResNetV2-18	Orthogonal-F	95.26 ± 0.12	77.87 ± 0.27	62.65 ± 0.14	—
ResNetV2-34	Linear	95.49 ± 0.09	78.92 ± 0.31	64.61 ± 0.24	—
ResNetV2-34	Orthogonal-F	95.75 ± 0.13	78.97 ± 0.04	65.46 ± 0.30	—

Orthogonal-G behaves similarly on CNNs (see paper for full table). The biggest jumps appear in ViT-B, where residual streams are both long and high-dimensional.

Training Dynamics

Orthogonal updates accelerate convergence and improve time-to-accuracy on ImageNet-1k with ViT-B/16. Stream norms stay near-constant, so steps act like rotations instead of magnifying the residual.

Training loss vs iterations — Training loss vs. iterations (ViT-B, ImageNet-1k)

Validation accuracy vs runtime — Validation Acc@1 vs. wall-clock (ViT-B, ImageNet-1k)

Efficiency at a Glance

The orthogonal projection adds O(sd) FLOPs per Transformer block. Measured throughput overhead is small for ViTs and moderate for CNNs.

Architecture	Linear (img/s)	Ortho-F (img/s)	Overhead
ResNetV2-34	1737.2	1634.0	5.94%
ResNetV2-50	1002.8	876.7	12.58%
ViT-S	3476.1	3466.3	0.28%
ViT-B	1270.1	1246.2	1.88%

Feature-wise projection is vectorization friendly and remains negligible compared with attention/FFN FLOPs. See the paper for FLOP breakdowns and mixed-precision implementation tips.

Inside the Residual Stream

We log how stream norms and parallel energy evolve per block. Linear updates suppress the parallel component and eventually shrink the residual magnitude; orthogonal updates keep energy balanced across depth.

MLP stream norm traces — Stream norm, MLP blocks (ViT-S, Tiny ImageNet).

MLP parallel component energy traces — Parallel component norm, MLP blocks.

Attention stream norm traces — Stream norm, attention blocks.

Attention parallel component energy traces — Parallel component norm, attention blocks.

Across both MLP and attention paths, linear updates drive the parallel energy toward zero after the transition point, while orthogonal updates keep it active and stabilize stream norms.

Orthogonal probability ablation — Applying orthogonal updates more frequently improves Tiny ImageNet accuracy (ViT-S, N=3).

Resources

Implementation

The GitHub repository documents the full PyTorch implementation, training recipes, and diagnostics we used for the poster. Follow the README workflows for ImageNet-1k and Tiny ImageNet reproductions.

Trained Models

Hugging Face hosts ViT-B checkpoints and logs under ortho-vit-b-imagenet1k-hf. Toggle the ortho config to switch between feature-wise and global projections.

Reproducibility

Training scripts, configs, and evaluation notes in the GitHub repo make the experiments fully reproducible end-to-end. Refer to the repository issues and wiki for ongoing updates.

BibTeX

@misc{oh2025revisiting,
  title        = {Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks},
  author       = {Giyeong Oh and Woohyun Cho and Siyeol Kim and Suhwan Choi and Youngjae Yu},
  year         = {2025},
  eprint       = {2505.11881},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG}
}

Acknowledgements & Licensing

This project page follows the presentation style pioneered by the Nerfies and LLaVA websites, both shared under the CC BY-SA 4.0 license. We thank the open-source community for releasing tooling and templates that made this work possible.

Usage and License Notices: All code, data pointers, and checkpoints linked from the Orthogonal Residual Update project are intended for research use only. Please ensure your usage complies with the licenses of upstream datasets (e.g., CIFAR, Tiny ImageNet, ImageNet-1k) and any referenced frameworks or pretrained models.