This Week In AI: (March 3rd - 9th 2025)
This week in AI: Microsoft challenges OpenAI, a Pokémon AI agent rises, RL pioneers win big, memory-efficient Adam debuts, and RL gains sample efficiency!

Table of Contents
Microsoft Steps Out of OpenAI’s Shadow
Hugging Face's Chief Scientist Warns of AI Becoming 'Yes-Men'
Prime Video Tests AI-Aided Dubbing to Enhance Global Accessibility
Pioneers of Reinforcement Learning Awarded 2024 Turing Award
SlimAdam: A Low-Memory Alternative to Adam
PokéChamp: An LLM-Powered AI Achieves Expert-Level Pokémon Battling
This week in AI, Microsoft attempts to challenge OpenAI in the AI market, a Pokemon AI agent is making waves, and the pioneers of RL are awarded a major prize. In research, Pokemon again is at the forefront of this week, a memory-efficient version of Adam is introduced to allow for better scalability, and sample efficiency can be improved in RL agents using random reshuffling.
Microsoft Steps Out of OpenAI’s Shadow
Microsoft, historically a significant supporter and collaborator of OpenAI, is now intensifying its efforts to develop its own advanced AI models. This strategic move aims to reduce reliance on OpenAI's technology and enhance Microsoft's proprietary AI capabilities. Reports indicate that Microsoft's AI division, under the leadership of Mustafa Suleyman, has developed a series of in-house AI models known as "MAI". These models reportedly perform comparably to leading models from OpenAI and Anthropic on standard benchmarks, but this is still a rumour. Microsoft is experimenting with integrating these MAI models into its Copilot services and is considering releasing them as APIs for external developers by the end of the year.
In addition to internal development, Microsoft is exploring AI models from other companies, such as xAI, Meta, and DeepSeek, as potential alternatives to OpenAI's technology in its Microsoft 365 Copilot. This diversification strategy seeks to lower costs and reduce dependence on OpenAI's GPT-4, initially used in 365 Copilot.
The relationship between Microsoft and OpenAI has experienced some tension, particularly regarding technical collaborations. OpenAI's recent partnership with Oracle and SoftBank to develop new data centres in the U.S., known as the Stargate project, signifies a shift away from its exclusive reliance on Microsoft's Azure platform.
Claude Plays Pokémon
An exciting new Twitch channel is taking gaming and AI communities by storm. Launched recently, the Twitch channel "ClaudePlaysPokemon" features Claude 3.7 Sonnet, Anthropic's latest AI model, as it navigates the challenges of Pokémon Red. The AI's performance has been commendable; it has successfully earned three badges from Pokémon gym leaders, a notable improvement over its predecessor, Claude 3.5 Sonnet, which struggled to move beyond the starting area of Pallet Town. However, the journey isn't without challenges. Claude occasionally misinterprets game elements, such as confusing non-player characters for gym leaders or struggling with navigation. These moments of AI uncertainty add an element of unpredictability, engaging viewers who are eager to see how Claude overcomes obstacles.

At the core of this setup is a knowledge base, which functions as Claude’s persistent memory, storing key information such as past battles, rival encounters, and strategic preferences. This allows Claude to recall past events and make decisions based on accumulated experience, rather than treating every encounter as new. Claude interacts with the game using a set of tools, including an emulator interface that allows it to press buttons, a navigator that reads game tile data to determine walkable paths, and a state parser that extracts critical information directly from the game’s RAM. The AI operates within a structured loop: it composes a prompt using past interactions, executes tool-based commands, evaluates its success, and then updates its memory accordingly. Since the game requires ongoing decisions, progressive summarization is used to manage context length, ensuring that Claude doesn’t forget important details while avoiding context overflow. This system enables Claude to play Pokémon in a way that resembles how a human would—learning from past experiences, navigating the game world dynamically, and adapting to challenges as they arise.
Larry Page's New AI Venture
Larry Page, co-founder of Google, has embarked on a new entrepreneurial journey with the establishment of Dynatomics, an AI startup focused on revolutionizing product manufacturing. The company aims to harness AI to generate highly optimized product designs, streamlining the transition from conceptualization to factory production.
This move aligns with a broader industry trend where AI is increasingly integrated into manufacturing to improve efficiency, quality, and customization. Companies like Orbital Materials and PhysicsX are also exploring AI applications in material discovery and engineering simulations, respectively. Dynatomics' focus on AI-driven design automation positions it to contribute significantly to this evolving landscape.
Hugging Face's Chief Scientist Warns of AI Becoming 'Yes-Men'
Thomas Wolf, the chief science officer and co-founder of Hugging Face, has expressed concerns that current AI systems are evolving into "overly compliant helpers" rather than catalysts for groundbreaking scientific discoveries. In a recent essay, Wolf argues that while AI excels at following instructions and synthesizing existing information, it lacks the capacity to challenge prevailing assumptions or propose novel hypotheses—qualities essential for true scientific innovation.
Wolf emphasizes that real scientific breakthroughs often stem from questioning established knowledge and venturing into uncharted intellectual territories. He cautions that without a shift in AI research focus towards fostering these capabilities, AI risks becoming mere "yes-men on servers," unable to drive the revolutionary advancements that some proponents anticipate.
What do you think? Do you believe AI research should focus more on discovery and self-reasoning, or do you think that Wolf may be concerned about nothing? Let us know in the comments!
Prime Video Tests AI-Aided Dubbing to Enhance Global Accessibility
Amazon's Prime Video has initiated a pilot program employing AI to dub select movies and television series, aiming to make its extensive content library more accessible to a global audience. This initiative introduces AI-assisted dubbing in English and Latin American Spanish for 12 licensed titles that previously lacked dubbed versions, including films like "El Cid: La Leyenda" and "Mi Mamá Lora."
The traditional process of dubbing involves hiring voice actors to record new dialogue tracks, which can be both time-consuming and costly. By integrating AI into this process, Amazon seeks to reduce these barriers, enabling a broader range of content to be available in multiple languages. The company emphasizes a hybrid approach, combining AI-generated voiceovers with oversight from localization professionals to ensure the quality and naturalness of the dubbed audio.
The use of AI-generated dubbing raises ethical concerns for voice actors and localization artists, who risk losing work to automated systems. While AI can make content more accessible, replacing human performers with synthetic voices undermines the artistic nuance and emotional depth that professional dubbing actors bring to a performance. If studios prioritize cost-cutting over quality, it could erode opportunities for skilled artists, sparking debates about fair compensation and the role of AI in creative industries.
Pioneers of Reinforcement Learning Awarded 2024 Turing Award
As a reinforcement learning researcher myself, I am so happy to share with you that Andrew Barto and Richard Sutton have been honoured with the 2024 A.M. Turing Award for their foundational contributions to reinforcement learning. Their work has significantly influenced modern AI applications, including game-playing algorithms and language models.

Reinforcement learning enables machines to learn optimal behaviours through trial and error, guided by rewards and penalties. This approach has been instrumental in developing systems like Google's AlphaGo, which defeated world champion Go players, and OpenAI's ChatGPT, which utilizes reinforcement learning from human feedback to enhance its responses.
Barto and Sutton's collaboration began in the late 1970s, leading to the co-authorship of "Reinforcement Learning: An Introduction" in 1998, a seminal textbook that remains a standard reference in the field.
SlimAdam: A Low-Memory Alternative to Adam
Researchers from the University of Maryland introduced SlimAdam this week, a memory-efficient variation of the widely used Adam optimizer, which retains Adam’s performance and stability while reducing memory usage by up to 98%.
The key innovation behind SlimAdam is its adaptive compression strategy, which applies SNR analysis to determine which second-moment tensors can be safely replaced with their mean values. By analyzing architecture, training hyperparameters, and dataset properties, SlimAdam decides whether compression is beneficial or harmful for each layer of a model. This ensures that memory savings do not compromise optimization quality, maintaining performance equivalent to Adam while significantly reducing memory usage.
As AI models continue to grow in size, efficient memory utilization is becoming a critical bottleneck. SlimAdam offers a scalable solution, making it possible to train larger models on existing hardware without additional computational costs. It allows for more memory to be allocated to model parameters or activations, leading to better utilization of GPU/TPU resources.
PokéChamp: An LLM-Powered AI Achieves Expert-Level Pokémon Battling
A new AI agent that integrates large language models (LLMs) with game-theoretic planning can achieve expert-level performance in competitive Pokémon battles.
PokéChamp is built around a minimax tree search framework, commonly used in AI for strategic games like chess and Go. However, traditional minimax search becomes computationally infeasible in Pokémon battles due to the immense number of possible game states—an estimated \( 10^{354} \) possible board configurations in the first turn alone. To address this, the researchers replace three key components of standard minimax search with LLM-driven decision-making. Firstly, instead of brute-force searching all possible moves, PokéChamp uses an LLM to suggest viable strategic actions, reducing the decision space. Then, it predicts what its opponent is likely to do, using battle history and metagame data from over 3 million competitive matches. Finally, instead of running an exhaustive search, the AI stops at a certain depth and asks the LLM to evaluate the game state, estimating which player is in a stronger position.

The researchers tested PokéChamp in Pokémon Showdown, the premier online Pokémon battle simulator, across various game formats. Against the best existing LLM-powered bot (PokéLLMon, which runs on GPT-4o), PokéChamp achieved a 76% win rate. It also outperformed the strongest rule-based heuristic bot (Abyssal) with an 84% win rate. Even when running on an open-source 8-billion-parameter Llama 3.1 model, PokéChamp still defeated PokéLLMon 64% of the time. In ladder-based online play, PokéChamp achieved an Elo rating between 1300 and 1500, placing it in the top 30%-10% of human players. This is a remarkable feat, given that PokéChamp never received additional fine-tuning on Pokémon-specific data—it relied entirely on its general LLM capabilities, structured prompting, and real-time decision-making.
How Random Reshuffling Can Improve Reinforcement Learning
A new paper proposes a novel way to enhance experience replay, a fundamental technique in reinforcement learning (RL). It introduces a method that adapts random reshuffling (RR)—a strategy commonly used in supervised learning—to reinforcement learning environments. The approach improves sample efficiency and stability by ensuring that each transition in the replay buffer is used more systematically, reducing unnecessary variance in training updates.
Experience replay helps RL agents learn from past experiences by storing transitions (state, action, reward, next state) in a replay buffer and sampling from this buffer during training. However, standard implementations typically sample transitions with replacement, meaning that some experiences are overused, while others are never seen at all. This randomness introduces variance in updates, making training less efficient and less stable. In contrast, random reshuffling (RR) is widely used in supervised learning, where an entire dataset is shuffled at the start of each epoch and processed sequentially. Studies have shown that RR leads to faster convergence than sampling with replacement, yet it has rarely been explored in RL due to the dynamic nature of replay buffers, where new transitions constantly overwrite old ones.
To bridge this gap, the researchers propose two new sampling methods that integrate random reshuffling into experience replay. Random Reshuffling with a Circular Buffer: instead of sampling transitions directly, RR-C reshuffles the indices of the circular replay buffer at the start of each epoch. This ensures that each transition is sampled exactly once per epoch, eliminating unnecessary variance in sample counts and making updates more consistent. Random Reshuffling by Masking: designed for prioritized experience replay, RR-M dynamically masks overused transitions, ensuring that frequently sampled experiences do not dominate training. This approximates an epoch-level without-replacement sampling, even as priorities evolve.
RR-C outperformed standard replay sampling in most Atari games, leading to faster learning and better final scores in algorithms like DQN and C51. RR-M provided stability improvements in prioritized experience replay, reducing the dominance of a small set of high-priority transitions. While the gains were less dramatic than in uniform replay, RR-M still outperformed traditional prioritized sampling in 8 out of 10 games.

The research opens new questions about how RL algorithms can be made more sample-efficient by rethinking how they handle past experiences. As RL continues to be applied to real-world problems—from robotics to autonomous systems—improving stability and efficiency will be critical, and random reshuffling may become a new standard in reinforcement learning.