News

This Week In AI: (January 20th - 26th 2025)

This week in AI: Trump unveils $500B Stargate project; OpenAI launches ChatGPT Operator; DeepSeek's R1 chatbot rises; new AI benchmark HLE introduced.

Daniel Bethell

27 Jan 2025 — 4 min read

Generated by DALL-E based of the article below as a prompt.

Trump announces Project Stargate

ChatGPT Operator

DeepSeek R1

Humanity's Last Exam

Enhancing Image Generation with Chain-of-Thought Reasoning

This week in AI saw major developments, including Trump’s announcement of Project Stargate, a $500 billion initiative to bolster U.S. AI infrastructure. OpenAI introduced ChatGPT Operator, a web-based AI assistant, while DeepSeek R1 emerged as a strong competitor to ChatGPT with its cost-efficient training and rapid adoption. In research, Humanity's Last Exam (HLE) set a new benchmark for LLM capabilities, emphasizing graduate-level reasoning challenges, and research on Chain-of-Thought reasoning demonstrated breakthroughs in enhancing image generation quality.

Trump announces Project Stargate

The Stargate project, announced by President Donald Trump on January 21, 2025, is a $500 billion initiative to enhance U.S. AI infrastructure. Spearheaded by OpenAI, Oracle, and SoftBank, it aims to build state-of-the-art data centres nationwide, fostering AI innovation and creating over 100,000 jobs. The project prioritizes high-speed connectivity and advanced hardware to ensure the U.S. remains a leader in AI development. This initiative seeks to address computational bottlenecks, though access to the infrastructure by academic researchers remains uncertain. It marks a significant move to strengthen the nation's competitive edge in AI. Read the full article below!

Since the announcement, Elon Musk (founder of SpaceX and Tesla) has publicly bashed the project claiming that the backers do not have the money that was claimed.

ChatGPT Operator

OpenAI has recently introduced "Operator," an AI agent designed to perform various web-based tasks on behalf of users. Currently available to ChatGPT Pro subscribers in the U.S. at $200 per month, Operator can handle activities such as ordering groceries, booking travel, and filling out online forms. It interacts with web pages through a browser, utilizing GPT-4o's vision capabilities combined with advanced reasoning. OpenAI is collaborating with companies like Instacart, Uber, and eBay to enhance Operator's functionality. While promising, the tool is in its early stages and may encounter challenges with complex web interfaces. OpenAI emphasizes user control and has implemented safeguards to ensure privacy and security.

DeepSeek R1

Chinese AI company DeepSeek made headlines with the launch of its chatbot, R1, which achieved rapid success by topping Apple's App Store rankings and entering the top 10 in UC Berkeley's Chatbot Arena. Remarkably, DeepSeek trained R1 with just $5.6 million, significantly less than typical AI development budgets. This efficient expenditure has sparked concerns among U.S. investors about China's growing AI competitiveness and its potential impact on the valuation of American tech stocks.

Humanity's Last Exam

In research, Humanity's Last Exam (HLE), introduces a challenging benchmark to evaluate the capabilities of advanced language models. With over 3,000 rigorously designed questions across diverse subjects like mathematics, humanities, and sciences, HLE aims to address the limitations of current benchmarks, which many models have saturated. Questions are designed to resist retrieval-based solutions and require graduate-level expertise. Current leading AI models, including GPT-4 variants, show less than 10% accuracy on HLE, highlighting a significant gap between AI and expert human performance. The benchmark also emphasizes accurate calibration, as models often respond incorrectly with high confidence. HLE's public release provides a critical tool for assessing the progress and limitations of AI in closed-ended academic tasks, enabling informed discussions in research and policy.

Compared to the saturation of some existing benchmarks, HLE accuracy remains low across several frontier models, demonstrating its effectiveness for measuring advanced, closed-ended, academic capabilities.

HLE should introduce a significant challenge in LLM research going forward. In a lot of LLM papers and related benchmark papers, most models can easily achieve high performance. These graduate-level challenges highlight the flaws in current models.

Enhancing Image Generation with Chain-of-Thought Reasoning

This paper explores the application of Chain-of-Thought (CoT) reasoning strategies to autoregressive image generation, aiming to improve the alignment and quality of generated images. The authors introduce the Potential Assessment Reward Model (PARM) and its enhanced variant, PARM++, which adaptively assess and refine image generation in a step-by-step process. These reward models address limitations in existing Outcome Reward Models (ORM) and Process Reward Models (PRM) by enabling fine-grained, step-wise evaluations and incorporating a reflection mechanism for iterative self-correction. Experiments on the GenEval benchmark demonstrate significant improvements in image-text alignment and generation quality, surpassing baseline models by 24% and outperforming advanced systems like Stable Diffusion 3 by 15%. The findings highlight the potential of integrating CoT reasoning with image generation for more robust and accurate results.