AIb2.io - AI Research Decoded

Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns

SWAP-Score judges a neural network by the sample-wise activation patterns it produces, which means it tries to spot a promising model before training has even had time to set the GPU fan screaming.[1]

There is a quiet elegance to that. Neural Architecture Search, or NAS, usually asks a rude question in the most expensive way possible: what if we trained thousands of candidate networks and kept only the winner? Very refined. Very sustainable. Very much the computational equivalent of tasting every soup in the restaurant and then ordering fries. Zero-shot metrics were invented to avoid that mess, but most of them have been fussy, brittle, or loyal to one architecture family only.[2][3]

Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns

Peng and colleagues propose something cleaner. Instead of asking how a network performs after training, they ask how its neurons light up across a small batch of samples right now, at initialization. That pattern, they argue, tells you something about the model's expressivity - its room to learn interesting functions rather than just sit there like a decorative houseplant with matrix multiplication.

The Tiny Fingerprint Test

Here is the basic idea.

Feed a mini-batch of inputs through an untrained network. Watch which neurons activate for each sample. Those on-off patterns form a kind of fingerprint. If the fingerprints vary richly across samples, the network may have a better capacity to separate and represent useful structure later.[1][2]

That is the heart of SWAP: Sample-Wise Activation Patterns. Its derived metric, SWAP-Score, turns those patterns into one number you can use to rank candidate architectures. No labels required. No training loop. No three-day GPU vigil with you checking the logs like a Victorian doctor waiting for news from upstairs.

This is where the paper gets especially neat. Many earlier zero-cost proxies worked better for CNNs or better for Transformers, but not both. SWAP-Score is designed to be architecture-agnostic enough to cross that border.[1][2] In the reported results, its Spearman correlation with true validation accuracy reached 0.93 for DARTS CNNs on CIFAR-10 and 0.71 for FlexiBERT Transformers on GLUE tasks.[1] In plain English: when SWAP liked a model, the fully trained version often ended up being good too.

Why This Feels Fresh

A lot of zero-shot NAS work has been a hunt for cheap signals: gradients, Jacobians, synaptic flow, linear regions, trainability hints, and so on.[3][4][5] Useful, yes. Universal, not always.

SWAP's appeal is its restraint. It leaves space - a little ma, if we borrow the Japanese design idea - and reads the structure already present in the network's reactions instead of forcing a full training drama. You are not asking, "Can this architecture win the marathon?" You are asking, "Does it at least walk like someone who has seen a marathon before?"

That matters because architecture search is still expensive, and industry keeps chasing better models for smaller devices, faster inference, and lower power budgets. Efficient model design is not an academic side quest anymore. It shows up in edge deployments, compact language models, and automated search pipelines that need to prune huge design spaces without burning a hole through the electricity bill.[6][7]

The paper also points out a practical bonus: SWAP-Score is label-independent. That means you can use it before downstream fine-tuning, which is particularly attractive for language models. If you can estimate which backbone looks promising before task-specific training, you save time, money, and perhaps one research intern's remaining faith in hyperparameter sweeps.[1]

The Calm Part Where We Do Not Hype Ourselves Into Orbit

Now for the adult supervision.

A strong correlation is not magic. Zero-shot proxies rank candidates; they do not replace full evaluation. Search spaces matter. Tasks matter. Robustness matters. A proxy that predicts clean accuracy well may still miss how a model behaves under distribution shift or adversarial noise.[8]

Recent work makes that pretty clear. Surveys and benchmarks still describe zero-shot NAS as useful but uneven, especially when methods jump across architectures or objectives.[3][4] ETAS focused on Transformer trainability and expressivity.[5] LPZero tried automating the design of the proxy itself for language models.[6] L-SWAG extended the activation-pattern line of thinking with gradient information for vision transformers.[7] Translation: the field is lively because nobody has found the one metric to rule them all. Anyone claiming otherwise is probably selling a benchmark with suspiciously convenient weather.

Still, SWAP feels like one of the cleaner ideas in this corner of ML. It takes a messy problem - choosing good networks without training them all - and attacks it with a small, sharp observation about how networks respond to data. Not flashy. Just well-composed.

That is often where the best engineering lives.

References

  1. Peng Y, Song A, Fayek HM, Ciesielski V, Chang X. Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2026. DOI: 10.1109/TPAMI.2026.3691075. PubMed: 42096383.
  2. Peng Y, Song A, Fayek HM, Ciesielski V, Chang X. SWAP-NAS: Sample-Wise Activation Patterns for Ultra-fast NAS. ICLR 2024 Spotlight. OpenReview: tveiUXU2aa. arXiv: 2403.04161.
  3. Wu MT, Tsai CW. Training-free neural architecture search: A review. ICT Express. 2024;10(1):213-231. DOI: 10.1016/j.icte.2023.11.001.
  4. Serianni A, Pilanci M. Training-free Neural Architecture Search for RNNs and Transformers. arXiv: 2306.00288, 2023.
  5. Yang J, Liu Y. ETAS: Zero-Shot Transformer Architecture Search via Network Trainability and Expressivity. Findings of ACL 2024. DOI: 10.18653/v1/2024.findings-acl.405.
  6. Dong P, Li L, Liu X, Tang Z, Liu X, Wang Q, Chu X. LPZero: Language Model Zero-cost Proxy Search from Zero. arXiv: 2410.04808, 2024.
  7. Casarin S, Escalera S, Lanz O. L-SWAG: Layer-Sample Wise Activation with Gradients information for Zero-Shot NAS on Vision Transformers. arXiv: 2505.07300, 2025.
  8. Biedenkapp A, Reuther M, Hutter F, Lindauer M. An Evaluation of Zero-Cost Proxies - from Neural Architecture Performance Prediction to Model Robustness. International Journal of Computer Vision. 2025. DOI: 10.1007/s11263-024-02265-7.

Disclaimer: This blog post is a simplified summary of published research for educational purposes. The accompanying illustration is artistic and does not depict actual model architectures, data, or experimental results. Always refer to the original paper for technical details.