Residual streams stay balanced
Orthogonal updates avoid runaway norms and prevent layers from collapsing into a single direction, yielding steadier training trajectories.
Accuracy gains where depth matters
Vision Transformers benefit the most: ViT-B improves ImageNet-1k Acc@1 by +3.78 points with the same training recipe.
Poster-ready diagnostics
The poster features block-wise cosine and norm traces, ablations on update probability, and throughput numbers to understand the mechanism.
Each block decomposes its output into components parallel and orthogonal to the incoming stream vector. We discard the parallel part and add only the orthogonal component to the residual pathway, which acts like a rotation on the representation manifold. The original module is untouched and still learns freely.
            f⊥.Accuracy numbers report Val Acc@1 averaged over five runs (five best epochs each). Gains are upstream of pre-training: every model is trained from scratch using the baseline recipe.
| Architecture | Connection | CIFAR-10 | CIFAR-100 | Tiny ImageNet | ImageNet-1k | 
|---|---|---|---|---|---|
| ViT-S | Linear | 89.82 ± 0.34 | 71.92 ± 0.24 | 51.30 ± 0.40 | 70.76 ± 0.26 | 
| Orthogonal-F | 90.61 ± 0.21 | 73.86 ± 0.31 | 52.57 ± 0.71 | 72.53 ± 0.49 | |
| ViT-B | Linear | 87.28 ± 0.41 | 68.25 ± 0.88 | 55.29 ± 0.71 | 73.27 ± 0.58 | 
| Orthogonal-F | 88.73 ± 6.06 | 75.07 ± 0.43 | 57.87 ± 0.37 | 77.05 ± 0.21 | |
| ResNetV2-18 | Linear | 95.06 ± 0.15 | 77.67 ± 0.28 | 62.04 ± 0.29 | — | 
| Orthogonal-F | 95.26 ± 0.12 | 77.87 ± 0.27 | 62.65 ± 0.14 | — | |
| ResNetV2-34 | Linear | 95.49 ± 0.09 | 78.92 ± 0.31 | 64.61 ± 0.24 | — | 
| Orthogonal-F | 95.75 ± 0.13 | 78.97 ± 0.04 | 65.46 ± 0.30 | — | 
Orthogonal-G behaves similarly on CNNs (see paper for full table). The biggest jumps appear in ViT-B, where residual streams are both long and high-dimensional.
Orthogonal updates accelerate convergence and improve time-to-accuracy on ImageNet-1k with ViT-B/16. Stream norms stay near-constant, so steps act like rotations instead of magnifying the residual.
            
            
            The orthogonal projection adds O(sd) FLOPs per Transformer block. Measured throughput overhead is small for ViTs and moderate for CNNs.
          
| Architecture | Linear (img/s) | Ortho-F (img/s) | Overhead | 
|---|---|---|---|
| ResNetV2-34 | 1737.2 | 1634.0 | 5.94% | 
| ResNetV2-50 | 1002.8 | 876.7 | 12.58% | 
| ViT-S | 3476.1 | 3466.3 | 0.28% | 
| ViT-B | 1270.1 | 1246.2 | 1.88% | 
Feature-wise projection is vectorization friendly and remains negligible compared with attention/FFN FLOPs. See the paper for FLOP breakdowns and mixed-precision implementation tips.
We log how stream norms and parallel energy evolve per block. Linear updates suppress the parallel component and eventually shrink the residual magnitude; orthogonal updates keep energy balanced across depth.
            
            
            
            Across both MLP and attention paths, linear updates drive the parallel energy toward zero after the transition point, while orthogonal updates keep it active and stabilize stream norms.
            The GitHub repository documents the full PyTorch implementation, training recipes, and diagnostics we used for the poster. Follow the README workflows for ImageNet-1k and Tiny ImageNet reproductions.
              Hugging Face hosts ViT-B checkpoints and logs under ortho-vit-b-imagenet1k-hf. Toggle the ortho config to switch between feature-wise and global projections.
            
Training scripts, configs, and evaluation notes in the GitHub repo make the experiments fully reproducible end-to-end. Refer to the repository issues and wiki for ongoing updates.
@misc{oh2025revisiting,
  title        = {Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks},
  author       = {Giyeong Oh and Woohyun Cho and Siyeol Kim and Suhwan Choi and Youngjae Yu},
  year         = {2025},
  eprint       = {2505.11881},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG}
}
          This project page follows the presentation style pioneered by the Nerfies and LLaVA websites, both shared under the CC BY-SA 4.0 license. We thank the open-source community for releasing tooling and templates that made this work possible.
Usage and License Notices: All code, data pointers, and checkpoints linked from the Orthogonal Residual Update project are intended for research use only. Please ensure your usage complies with the licenses of upstream datasets (e.g., CIFAR, Tiny ImageNet, ImageNet-1k) and any referenced frameworks or pretrained models.