From Symbols to Signals: Backpropagation, Transformers, and the Rise of Agentic AI

Introduction

Artificial Intelligence (AI) did not begin with today’s data-hungry deep networks. It began in the 1950s with a symbolic wager: that intelligence could be captured in explicit rules and logical formalisms. Over decades, that wager was challenged by the return of neural networks, reinvigorated by the mathematics of backpropagation. The breakthrough of deep learning in the 2010s, culminating in AlexNet, shifted the paradigm decisively towards learning systems. Then, in 2017, the Transformer architecture redefined natural language processing and general-purpose AI.

Today, we stand at another inflection point: the move from predictive models to Agentic AI — systems capable of planning, using tools, and self-improving.

This blog traces that arc, weaving technical depth with historical narrative.

1. Symbolic AI: The Age of Logic and Rules (1950s–1970s)

The Dartmouth Conference of 1956, led by John McCarthy, Marvin Minsky, Claude Shannon, and others, framed AI as a symbolic enterprise. The central claim — the Physical Symbol System Hypothesis — argued that symbol manipulation was both necessary and sufficient for intelligence (Newell & Simon, 1976).

McCarthy’s LISP (1960) provided a language for symbolic reasoning. Expert systems (1970s–80s) codified knowledge into “if–then” rules.

These systems worked well in narrow, well-structured domains. But they struggled with ambiguity, noise, and the messy realities of the physical world.

2. Neural Networks: From Perceptrons to the “AI Winter”

In parallel, Frank Rosenblatt’s Perceptron (1958) showed that machines could learn patterns from data. Yet the optimism was short-lived.

Minsky & Papert’s Perceptrons (1969) mathematically demonstrated that single-layer perceptrons could not represent even simple functions like XOR. Without efficient algorithms to train multilayer networks, funding dried up. Neural nets entered a decades-long winter.

3. Backpropagation and the Return of Learning (1980s–1990s)

The renaissance came with backpropagation, popularised by Rumelhart, Hinton & Williams (1986). Using the chain rule, backprop efficiently calculated gradients through multiple layers, enabling multilayer perceptrons (MLPs) to learn distributed internal representations.

Challenges soon emerged:

Vanishing and exploding gradients (Bengio et al., 1994) limited depth. Recurrent nets failed to capture long-term dependencies, until LSTMs (Hochreiter & Schmidhuber, 1997) mitigated gradient decay.

Though promising, neural nets still lacked the compute and data to rival symbolic approaches.

4. The Deep Learning Breakthrough: AlexNet and Beyond (2006–2015)

By the mid-2000s, three conditions aligned:

Compute — GPUs enabled large-scale parallel training. Data — datasets like ImageNet provided millions of labelled examples. Optimisation tricks — ReLU activations (Nair & Hinton, 2010), Dropout (Srivastava et al., 2014), BatchNorm (Ioffe & Szegedy, 2015), and ResNets (He et al., 2016).

The watershed moment was AlexNet (Krizhevsky, Sutskever & Hinton, 2012), which halved ImageNet error rates and shocked the computer vision community. It proved that end-to-end learned features could outperform hand-engineered pipelines.

Case Study – ImageNet & AlexNet:

Before AlexNet, vision systems relied on handcrafted features like SIFT and HOG. AlexNet demonstrated that a deep convolutional architecture, trained on GPUs with ReLU and Dropout, could learn superior features. This convinced industry — from Google to Facebook — to invest heavily in deep learning.

5. Transformers: Attention as Architecture (2017–Present)

The leap in language came not from convolution or recurrence, but from attention.

5.1 How Transformers Work

The 2017 paper Attention Is All You Need (Vaswani et al.) introduced the Transformer, which replaced recurrence with self-attention.

Input embeddings: Words (or subwords) are mapped into vectors. Positional encoding: Since attention is order-agnostic, sinusoidal position encodings inject sequence order. Self-attention mechanism: Each token generates three vectors: Query (Q), Key (K), and Value (V). Attention scores are computed as: \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V This allows each token to “attend” to all others, capturing long-range dependencies. Multi-head attention: Multiple parallel attention heads learn different types of relationships. Feed-forward layers: After attention, position-wise feed-forward networks add nonlinearity. Residual connections + LayerNorm: Stabilise deep training.

The architecture scales linearly with sequence length, and — crucially — allows parallel training across tokens, unlike RNNs.

5.2 Scaling Laws and Foundation Models

Transformers scale elegantly: more parameters, more data, more compute → reliably better performance (Kaplan et al., 2020).

Case Study – BERT (2018):

BERT used masked language modelling and bidirectional attention to achieve state-of-the-art across NLP benchmarks. It proved that pre-training on vast corpora followed by fine-tuning on small datasets could generalise across tasks.

Case Study – GPT (2018–2023):

OpenAI’s GPT series scaled Transformers to hundreds of billions of parameters, showing emergent abilities: in-context learning, reasoning, and coding.

6. Why Neural Nets Triumphed Over Symbolic AI

Representation learning: Neural nets discover abstractions directly from data. Scalability: Transformers leverage parallel hardware and internet-scale data. Robustness: Learned features adapt where brittle rules collapse. Generalisation: Foundation models are general-purpose priors, adaptable with minimal task-specific tuning.

Symbolic AI never disappeared — constraint solvers, planners, and verifiers are still embedded within hybrid neuro-symbolic systems — but the locus of progress shifted to learning.

7. Towards Agentic AI (2022–2025)

The next step is not merely larger models, but agentic behaviours layered on top:

ReAct (Yao et al., 2022): interleaves reasoning (chain-of-thought) with actions (API calls, tool use). Toolformer (Schick et al., 2023): self-supervised discovery of when and how to use external tools. Tree-of-Thoughts (Yao et al., 2023): structured search across reasoning branches. Reflexion (Shinn et al., 2023): self-improvement via feedback loops. Voyager (Wang et al., 2023): autonomous skill acquisition in Minecraft.

Case Study – ReAct Agents:

Where GPT can only predict, ReAct agents plan step by step:

Think (“I need today’s stock price”). Act (call an API). Reflect (was the answer coherent?). This tightens the loop from prediction to autonomy.

Conclusion: From Symbols to Agents

The arc of AI is a story of representation.

Symbolic AI represented knowledge as explicit rules. Neural nets represented patterns in distributed weights. Transformers represented context and meaning through attention across massive corpora. Agentic AI represents a new phase: embedding goals, plans, memory, and action into these powerful learners.

If the 1950s were about describing intelligence, and the 2010s about predicting with intelligence, the 2020s may be about acting with intelligence — responsibly, safely, and aligned with human values.

References

McCarthy, J. (1960). Recursive functions of symbolic expressions and their computation by machine, Part I. Communications of the ACM. Minsky, M., & Papert, S. (1969). Perceptrons. MIT Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS. Vaswani, A. et al. (2017). Attention is All You Need. NeurIPS. Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. Kaplan, J. et al. (2020). Scaling Laws for Neural Language Models. arXiv. Yao, S. et al. (2022). ReAct: Reasoning and Acting in Language Models. arXiv. Schick, T. et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv. Shinn, N. et al. (2023). Reflexion: Self-Improving Agents with Verbal Reinforcement Learning. arXiv. Wang, Q. et al. (2023). Voyager: An Open-Ended Embodied Agent in Minecraft Powered by Large Language Models. arXiv.

From Symbols to Signals: Backpropagation, Transformers, and the Rise of Agentic AI

Share this:

Leave a comment Cancel reply