Implications of AI Convergence
Larger AI models will eventually become commodities, this creates significant opportunity
...In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province. In time, those Unconscionable Maps no longer satisfied, and the Cartographers Guilds struck a Map of the Empire whose size was that of the Empire, and which coincided point for point with it. The following Generations, who were not so fond of the Study of Cartography as their Forebears had been, saw that that vast Map was Useless, and not without some Pitilessness was it, that they delivered it up to the Inclemencies of Sun and Winters. In the Deserts of the West, still today, there are Tattered Ruins of that Map, inhabited by Animals and Beggars; in all the Land there is no other Relic of the Disciplines of Geography.
—Suarez Miranda,Viajes de varones prudentes, Libro IV,Cap. XLV, Lerida, 1658
Scaling Laws and AI Convergence
Recent developments in AI have been following the principle of “Scaling Laws”, that bigger is always better. And so far this has turned out to be the case. The bigger the model, the more compute it consumes, and the more data used to train, has invariably led to better performance and more generalizability. Gone are the days of making a domain-specific text or image classifier, a GPT4 or Dall-E 3 can do it all. This development, however, can only sustain up to a point. What is the usefulness of an AI whose scale requires all the world's computational power to work?
The core challenge in AI today lies in translating the impressive capabilities of frontier models into practical applications that significantly boost productivity. While scaling laws have driven progress and the cost of each new foundation model at the frontier of it capabilities falls, the cost of training each model rises, and often times the cost of using these models is higher than their economic benefit. The limitations to AI advancement are also increasingly rooted in political and economic factors. The finite nature of compute, electricity, and data poses constraints on the "bigger is better" approach. As the value of these resources becomes more apparent, access becomes restricted, with companies like NVIDIA controlling GPU availability and platforms like Reddit monetizing their data through APIs.
What is needed are products or workflows that make models more efficient, maps that usefully explain the territory without being exact replicas of it. In order to achieve this, we need to peek under the hood, into the latent space. A recent paper called the Platonic Representation Hypothesis has suggested that bigger, better models are more likely to converge on their understandings of the world, i.e. their representations in the latent space, regardless of how the model was built, what training data was used, and what objectives the model were given. This finding holds true even when the modality of data is different, with the best models capturing similar representations of the world regardless of whether they were trained on text or image data. This convergence in model behavior has significant implications for the future of AI development and adoption.
To illustrate the concept of convergence, consider the analogy of specialists versus generalists in science. Specialists, equipped with their specific toolkit, often view problems through the lens of their expertise. Physicists might reduce everything to physical phenomena, while psychologists might interpret everything through a psychological framework. Their expertise is tuned well towards problems properly within their domain, but not to general unified theories. Scientific revolutions tend to be instigated not by specialists, but generalists, who see relationships between fields that are hard to notice within narrow specialities, developing new methods by which to attack phenomena (think of the history of computing and the role of generalists like Liebniz and Von Neumann in its advance).
Before the advent of Transformers in 2017, AI development mirrored the specialist approach, with models tailored to specific tasks and domains. However, Transformers, with their ability to process and generate diverse forms of data, have ushered in an era of generalist AI models. Unsurprisingly, this has led to remarkable progress in AI capabilities.
The scale and generalizability of foundation models at the frontier of AI like GPT4 or Gemini may worry some about the concentration of power that AI advancement enables, making some hesitant to accelerate AI adoption. The observed trend that the more capable and generalist models become, the more they converge in their understanding of the world has several implications that run counter the concentration narrative. Convergence offers optimism that foundation models developed by AI leaders will eventually become commodities, as evidenced both by the increasing similarities between frontier models and the shrinking timeline between the release of a frontier model and its open-source equivalent. Appreciating the insights of convergence provides an opportunity for increasing AI adoption in the short term through the proliferation of more numerous models.
Natural vs Social Facts
The examples used to identify model convergence are primarily factual, focusing on tangible aspects of the world. However, numerous domains require understanding beyond mere facts. In fields like law, the interpretation of terms and their operational logic are context-dependent and shaped by legal precedents and interpretations. Generalist models, while adept at recognizing patterns in data across domains, struggle to grasp the nuances of specific domains where meaning is not solely derived from objective reality.
Furthermore, certain concepts exist solely in the context of human relationships and social structures. For instance, understanding poverty involves not only its economic definition but also its social implications and the relative experiences of individuals within a society. This "social ontology," the idea that the world is made up of more than just objects and that social concepts in some sense really exist, is unlikely to be fully captured in a singular one-size-fits-all model.
Existing approaches such as Retrieval-Augmented Generation (RAG) and Fine-Tuning are aimed at addressing these context and domain-specific understandings lacking in current frontier models. As models become more capable the ability to take advantage of these approaches only increases, with the underlying architectures required to run RAG workloads becoming more commoditized. Vertical AI, the idea of turning the generalist foundation models into specialist approaches tailored to a vertical domain, would based on limits placed by social ontology continue to add value on top of increasingly performant foundation models.
In scenarios where a single vertical AI is insufficient, users may need to choose among various AIs based on factors like performance, domain specificity, or cost. This presents a significant opportunity for products that incorporate effective routing between AIs, selecting the most suitable model based on the specific task and the relevant social ontology. Such products could enhance the overall utility and effectiveness of AI applications by ensuring that the right tool is used for the right job.
Another potential implication of increased AI performance leading to a convergent ontology is that the success of AIs that encode certain onologies can be used to test how true those ontologies are. This can provide a scientific means of testing certain philosophical positions with a measurable mechanism: the performance of an AI.
Any to Any Models
The convergence of AI capabilities across modalities suggests the potential for "any-to-any" models that can seamlessly process and generate various forms of data and media. Imagine AI models that not only understand and generate text and images but also effortlessly transition between audio, video, and other formats. Whereas convergence on understanding is limited by social ontology necessitating vertical AI solutions for specific domains, the convergence across modalities unlocks horizontal AI opportunity that could revolutionize user experiences, enabling more intuitive and interactive interactions with technology.
TikTok, arguably the most effective AI-driven platform to date, offers a glimpse into this future. One of the mechanisms behind TikTok's high engagement is its auto-generated transcripts for video uploads. The dynamic and visually appealing text overlays allow users to grasp the video's content even without audio, increasing the app's overall engagement. This simplification of cross-modality content production demonstrates how AI can increase the total addressable time individuals spend interacting with a product.
True cross-modality would allow AI models to interact with people more often, increasing the surface area of their impact. This could manifest as personalized recommendations, content generation tailored to individual tastes, or even AI-powered assistants that anticipate our next move. Such ubiquitous AI would become embedded into our daily lives, influencing everything from our entertainment choices to our work processes. The rise of any-to-any models would also raise important questions about the nature of media and its impact on society. As the lines between different forms of media blur, we may need to rethink traditional notions of authorship, intellectual property, and the role of media in shaping our perceptions of the world.
The rise of any-to-any models creates more strategic questions for businesses creating experiences that likely prevent any one incumbent from dominating all markets. Marshall McLuhan's famous dictum, "the medium is the message," suggests that the form in which information is conveyed inherently shapes its meaning and impact. In a world of any-to-any AI, the distinction between different media would become increasingly fluid.
Consider the choice between Amazon Audible and Spotify audiobooks. The former is a bet that content matters, I am reading a physical book I bought from Amazon and want to continue it on my drive so I use audible, while the other is a bet that medium matters, I feel like listening to content, and I choose between a song or an audiobook. While both offer the same content, the platform itself influences the user experience and potentially the way the content is perceived. In a future where AI can seamlessly translate between audio, text, and visual formats, platform design and experience would play more of a role than content or even their AI capabilities.
Much of the technical capacity for this exists at present and products that design effective platform mechanics will only benefit from increasingly performant frontier models. Many dismiss a lot of the AI products that currently exist as simply wrappers around frontier models like GPT4, but some of these products which have taken advantage of the weaknesses of a previous generation of foundation models through things like system prompts have the opportunity to build better platform mechanics that increase AI penetration.
The Individual Experience
As the best models converge they enable better vertical products built for specific domains and horizontal products move across modalities the opportunity for AI to penetrate a greater share of a person’s day increases. Aligning the use of AI, however, to an individual’s needs and expectations is important so that these tools can be more retentive, engaging, and impactful.
An opportunity for individualizing the AI experience is presented by the field of representational alignment, an approach to aligning AI models' understanding with human values and intentions. The convergence of AI capabilities in frontier models combined with advancements in representational alignment offer a promising path towards addressing a significant barrier to widespread adoption: prompt engineering. The unnaturalness of communicating with AIs for most tasks with their understanding not matching each users has been a major driver of low retention.
Each individual operated in the world not solely on Platonic Representations, but personal understanding of the world around them. Their understanding is also not static, with the world being it’s own model, shaping understanding through their social and physical environment. We see the need to adapt one’s conceptual understanding in the prevalence of the term “code-switching” in management. Lacking a complete explicit formulation of how an individual understands the world, representational alignment offers a way to learn from their behaviors and customize models to respond in a way natural to each specific user.
The more cross-modal data that can be collected on a user’s values, beliefs, and understanding the more performant the tailoring to that user can be. By incorporating a router that intelligently selects the most appropriate model based on the context, AI-powered personal assistants can adapt to individuals' diverse needs and preferences throughout their day. This adaptability can lead to more personalized and effective AI interactions, ultimately increasing user satisfaction and adoption.
The recent Apple Intelligence announcements indicate the benefits of taking an individualized approach, particularly in terms of privacy and cost-effectiveness. If larger models can be leveraged to develop high-performing vertical AIs capable of handling specific tasks without extensive computational resources for inference, and a router can intelligently determine when to utilize a particular model, the potential for embedding AI into various products expands significantly.
These approaches to personalization have mostly been explored in language models, but convergence suggests that models of every modality perform in similar ways. Representationally aligning image models for example would dramatically improve their ability to be leveraged within marketing domains. Most marketers I’ve spoken to who have tested image models for creatives have complained that the outputs differ markedly from what they pictured. Certain products have leveraged system prompts to create styles that make some of this work easier, but there are limits to how well that performs. What if image generators could learn to be aligned with individual representations the way LLMs increasingly are? The ability to chain these models across modalities would increase and the opportunity for productivity impacts would explode.
Conclusion
The growing convergence of AI capabilities at the frontier suggests that there is some objective reality that models are moving towards capturing as they get bigger and more performant. There is much hubbub around the nearness of AGI, super capable models that can perform any task that a human can and then some, but one need not have to believe in this to appreciate the importance of accelerating AI adoptions. The capabilities are there to deliver large value, and not enough of that value has been realized in the economy as of yet. Understanding the trends that current AI research shows about future capabilities provides a guide to building right now products and services that will only deliver more value over time.
While the "bigger is better" approach has driven substantial progress, AI's immediate future lies in harnessing these models' emergent properties to develop more efficient, specialized, and user-friendly solutions. This entails tailoring frontier models vertically to specific domains with unique social ontologies, such as law and medicine, where understanding context and nuance is paramount. This involves expanding AI capabilities horizontally across modalities, taking advantage of the cross-modal convergence to enable seamless interaction with various forms of data and media. Furthermore, focusing on personalization and developing lightweight models that can be easily deployed on various devices can democratize access to AI and accelerate its adoption across different sectors.
By embracing these strategies, we can move beyond the resource-intensive approach of scaling models and focus on creating AI solutions that are not only powerful but also practical, maps that guide but do not cover the territory.