News Dashboard

US

Officials warn of 'moderate' risk after TB outbreak in SF high school - SFGATE

Three active cases have been reported at Archbishop Riordan High School since last November.
Read more →

S&P 500 futures are little changed as traders weigh tech giants' earnings: Live updates - CNBC

"Magnificent Seven" companies Meta Platforms, Microsoft and Tesla posted earnings results after Wednesday's close.
Read more →

Bruce Springsteen sings out against Trump in ‘Streets of Minneapolis’ - AP News

Bruce Springsteen has released a new song, “Streets of Minneapolis,” criticizing President Donald Trump's immigration enforcement. The song describes Minneapolis as “a city aflame” under “King Trump’s private army.” Springsteen says he wrote and recorded it o…
Read more →

Trump administration finds California’s ban on ‘forced outing’ of students violates federal law - Politico

Federal officials threatened to pull education funding unless the state takes steps to amend its rules.
Read more →

'Halide' Co-Founder Sebastiaan de With Joins Apple's Design Team - MacRumors

Sebastiaan de With, co-founder of the popular iPhone camera app Halide, today announced that he has joined the Human Interface Design team at Apple. ...
Read more →

Satena: Colombia launches search for missing plane carrying 15 people - BBC

State ariline Satena says its aircraft carrying 13 passengers and two crew "suffered a fatal accident".
Read more →

OpenAI Wants To Create Biometric Social Network To Kill X’s Bot Problem - Forbes

OpenAI is quietly building a social network and considering using biometric verification like World’s eyeball scanning orb or Apple’s Face ID to ensure its users are people, not bots.
Read more →

First pediatric flu death in Washington state highlights rising cases across the state - komonews.com

A school-age teenager has died after becoming ill from influenza last week, marking the first pediatric influenza death in the state this season.
Read more →

Brandon Sanderson’s Literary Fantasy Universe ‘Cosmere’ Picked Up by Apple TV (Exclusive) - hollywoodreporter.com

It's an unprecedented deal for the author, whose 'Mistborn' series and 'The Stormlight Archive' are being eyed for film and television adaptation, respectively.
Read more →

A Seat on Trump’s “Board of Peace” Costs $1 Billion. Guess Who Gets the Money. - Slate

The leaders of China, India, and Russia are among those who haven’t yet responded.
Read more →

Home Depot to cut 800 corporate jobs, require workers back to office full time - ajc.com

Home Depot says it is eliminating about 800 corporate jobs tied to its Vinings headquarters.
Read more →

Where a nor’easter will bring heavy snow, strong winds and waves this weekend - The Washington Post

A potent system will bring the potential for blizzard-like conditions and coastal flooding from parts of the Southeast to New England.
Read more →

Trump's National Guard deployments could cost over $1 billion this year, CBO projects - NPR

The operation in Washington, D.C. alone is projected to cost upwards of $660 million if it runs through the end of this year as expected, according to new data released by the nonpartisan Congressional Budget Office.
Read more →

What it’s like each day in Minneapolis - CNN

Residents of the Twin Cities region share their personal accounts of what it’s like to live in the midst of an ICE surge.
Read more →

DEBRIEF: What happened on Day 3 of the Barcelona Shakedown? - F1 - The Official Home of Formula 1® Racing

With the third day of the Barcelona Shakedown done and dusted, F1.com has the lowdown on which teams ran and what the drivers said.
Read more →

The U.S. measles outbreaks. - Tangle News

A closer look at the rise in measles cases across the country.
Read more →

A Comprehensive Network for the Discovery and Characterization of Interstellar Objects Like… - Avi Loeb – Medium

Inspired by the unresolved anomalies displayed by the latest interstellar visitor 3I/ATLAS (as listed here), I co-authored a new paper with…
Read more →

Portland Fire reveals home and away jerseys for 2026 season - oregonlive.com

The 2026 Portland Fire jerseys are here.
Read more →

Business

Fed holds rates steady, signaling risks to economy are dropping - The Washington Post

The central bank opted for its first pause after three rate cuts last year, as officials wait for clearer signs that inflation is cooling.
Read more →

Futures: Meta Jumps, Microsoft Skids; Tesla CapEx To Soar - Investor's Business Daily

Fed chief Jerome Powell said rates may be on pause for longer.
Read more →

Microsoft’s Earnings Surge Is Overshadowed by Data-Center Spending - The Wall Street Journal

No content available
Read more →

Tesla profits slumped 46% last year, as it lost its crown as the top EV seller - NPR

The company announced it was ending production of its higher-end Model S and Model Y, and turning that production space over to making humanoid robots.
Read more →

Amazon announces 16,000 corporate job cuts, shaking Seattle's economy - komonews.com

Amazon has announced mass layoffs that will affect nearly one in 10 members of its corporate workforce, cutting about 16,000 jobs in a move that is already send
Read more →

Zuckerberg teases agentic commerce tools and major AI rollout in 2026 - TechCrunch

Mark Zuckerberg says 2026 will be "a big year for delivering personal super intelligence."
Read more →

Meta Platforms Stock Investors Just Got Fantastic News from CEO Mark Zuckerberg - The Motley Fool

The social media giant continues to fire on all cylinders.
Read more →

Carvana Dives After Short Seller Criticizes Ties To Lenders - Investor's Business Daily

Carvana stock fell about 20% on Wednesday after a short-seller report alleged the company's earnings depended on shaky loans.
Read more →

Tesla to invest $2B in Elon Musk’s xAI - TechCrunch

Elon Musk's AI company xAI disclosed earlier this month it had raised $20 billion.
Read more →

Bitcoin price news: BTC stuck at $89,000 as gold surges to fresh record - CoinDesk

Gold fans rushed in to buy as the Fed chair said he took no macro signal from the raging bull market in precious metals.
Read more →

OpenAI Wants To Create Biometric Social Network To Kill X’s Bot Problem - Forbes

OpenAI is quietly building a social network and considering using biometric verification like World’s eyeball scanning orb or Apple’s Face ID to ensure its users are people, not bots.
Read more →

ASML Stock Surges On Strong Sales Forecast For 2026 - Investor's Business Daily

No content available
Read more →

Home Depot to cut 800 corporate jobs, require workers back to office full time - ajc.com

Home Depot says it is eliminating about 800 corporate jobs tied to its Vinings headquarters.
Read more →

Starbucks scraps $250,000 cap on boss's use of company jet - BBC

The coffee chain changes Brian Niccol's travel budget due to media attention and "credible threat actors".
Read more →

Bank of America, JPMorgan Chase to contribute $1,000 to Trump Accounts for their employees - CBS News

Two of the biggest U.S. banks said they would match a $1,000 federal contribution for employees who open a Trump Account, touting the plan as a way to save money.
Read more →

The economy's pressure relief valve: the U.S. Dollar - Axios

The U.S. dollar index is down 3.2% since Jan. 16.
Read more →

Trump proposal signals Medicare austerity - Politico

Health insurers who thought Trump would rescue their Medicare businesses got a rude awakening Monday.
Read more →

Anthropic, Apple, OpenAI CEOs condemn ICE violence, praise Trump - TechCrunch

Anthropic's Dario Amodei and OpenAI's Sam Altman spoke out against ICE enforcement tactics following Minneapolis violence, with one addressing concerns publicly and the other in an internal message.
Read more →

UPS Is Firing Its Biggest Customer -- And Wall Street Finally Understands Why - The Motley Fool

UPS is doubling down on reducing its relationship with Amazon.
Read more →

Technology

9front OS

Comments
Read more →

‘Backseat Software’

Mike Swanson: What if your car worked like so many apps? You’re driving somewhere important…maybe running a little bit late. A few minutes into the drive, your car pulls over to the side of the road and asks: “How are you enjoying your drive so far?” Annoyed by the interruption, and even more behind schedule, you dismiss the prompt and merge back into traffic. A minute later it does it again. “Did you know I have a new feature? Tap here to learn more.” It blocks your speedometer with an overlay tutorial about the turn signal. It highlights the wiper controls and refuses to go away until you demonstrate mastery. Ridiculous, of course. And yet, this is how a lot of modern software behaves. Not because it’s broken, but because we’ve normalized an interruption model that would be unacceptable almost anywhere else. I’ve started to think of this as backseat software: the slow shift from software as a tool you operate to software as a channel that operates on you. Once a product learns it can talk back, it’s remarkably hard to keep it quiet. This post is about how we got here. Not overnight, but slowly. One reasonable step at a time. If that lede pulls you in, like it did for me, you’re going to love the rest of the essay. This is one for the ages. It’s so good. ★
Read more →

Let’s Keep an Eye on Apple’s Own iOS Adoption Numbers

When I wrote last week about the false narrative that iOS 26 is seeing bizarrely low adoption rates compared to previous years, I neglected one source: Apple itself. Apple’s Developer site publishes a page with iOS and iPadOS usage for devices that “transacted on the App Store”. The hitch is that they only seem to update those numbers twice a year — once right around now, and once again right before WWDC. As of today, those numbers are still from 4 June 2025. Last year, going from the Internet Archive, the numbers were still from iOS 17 (June 2024) on 23 January last year, but were updated for iOS 18 on 24 January. Here are those iOS 18 numbers from one year ago this week. iPhones released in the previous four years: iOS 18: 76% iOS 17: 19% iOS < 17: 5% All iPhones: iOS 18: 68% iOS 17: 19% iOS < 17: 13% iPads released in the previous four years: iPadOS 18: 63% iPadOS 17: 27% iPadOS < 17: 10% All iPads: iPadOS 18: 53% iPadOS 17: 28% iPadOS < 17: 19% (Apple itself manages to present these statistics without ever using the plurals iPhones or iPads, instead referring only to “devices”.) I presume, or at least hope, that they’ll update these numbers for iOS 26 any day now. ★
Read more →

Box Office Expectations for ‘Melania’

Jeremy Fuster, reporting for TheWrap: But save for some theaters in Republican-heavy states, the film is unlikely to leave much of an impact at a slumping box office, with theatrical sources telling TheWrap that “Melania” is projected for an opening of around $3 million this weekend. That would put it below the last right-wing documentary, the Daily Wire-produced Matt Walsh film “Am I Racist?,” which opened to $4.5 million from 1,517 locations in September 2024, finishing with a $12.3 million total that made it the highest-grossing doc that year. The highest projections are coming from NRG with an estimate of around $5 million, though audience interest polls from the company have 30% saying they are “definitely not” interested in watching the film, an unusually high count for any wide release. These projections are with a $35 million promotional campaign, for a movie Amazon paid $40 million to purchase. (Via Taegan Goddard.) ★
Read more →

Amazon’s Spending on ‘Melania’ Is a Barely Concealed Bribe

Nicole Sperling and Brooks Barnes, reporting for The New York Times: Amazon paid Ms. Trump’s production company $40 million for the rights to “Melania,” about $26 million more than the next closest bidder, Disney. The fee includes a related docuseries that is scheduled to air later this year. The budget for “Melania” is unknown, but documentaries that follow a subject for a limited amount of time usually cost less than $5 million to produce. The $35 million for marketing is 10 times what some other high-profile documentaries have received. All of which has a lot of Hollywood questioning whether Amazon’s push is anything more than the company’s attempt to ingratiate itself with President Trump. This is a good story, with multiple industry sources with experience making political documentaries, but the Times’s own subhead downplays Amazon’s spending on the film: “The tech giant is spending $35 million to promote its film about the first lady, far more than is typical for documentaries.” They’re spending $35 million now, to promote it, but they already paid $40 million for the rights to the film, $28 million of which is believed to have gone to Melania Trump herself. A $35 million total spend would be a lot compared to other high-profile documentaries, but it’s a $75 million total spend. This is not just a little fishy — it’s a veritable open air seafood market. Back to the Times: To grasp just how uncustomary Amazon’s marketing push for “Melania” is, consider how Magnolia Pictures handled “RBG,” a portrait of Ruth Bader Ginsburg during her 25th year as a Supreme Court justice, in 2018. CNN Films produced “RBG” for around $1 million. The promotional budget, including an awards campaign that helped it land two Oscar nominations, totaled about $3 million. The film debuted in 34 theaters and expanded into 432 locations over several weeks. It ultimately collected $14 million, enough to rank as the year’s No. 1 political documentary. And: On Friday, “Melania” will also be released in 1,600 theaters overseas, where FilmNation, a New York company, is handling distribution in more than 20 countries. International ticket sales are expected to be weak, according to box office analysts. Shocker. ★
Read more →

Kickstarter for Ollie’s Arcade Expansion

Ged Maheux, The Iconfactory: This week we announced a new Kickstarter that’s aimed at expanding the game offerings of Ollie’s Arcade, the fun, ad-free retro gaming app we introduced back in 2023. Ollie’s Arcade has always been a great way to escape doomscrolling, even if just for a little while, and now we have an opportunity to bring these retro games to even more people on iOS. The Kickstarter aims to raise enough money to make all of the in-app purchase games in the app completely free for everyone to enjoy. We also want to bring our beloved puzzle game, Frenzic, to life once again. Frenzic was one of the very first games available on iOS back in 2008, then was reborn as Frenzic: Overtime on Apple Arcade. Since it left, people have been asking us for a new version that they can just pick up and play. We couldn’t agree more! I linked to the Kickstarter for the original Ollie’s Arcade project back in 2023, which was a big success. And I first linked to Frenzic all the way back in 2008, when the App Store was only a few months old. It’s just a great concept for a casual game on a small screen, implemented with all of The Iconfactory’s exquisite attention to detail. That’s true for all the games in Ollie’s Arcade, but Frenzic is special. This new Kickstarter for the Ollie’s Arcade expansion has already hit its funding goal, but it’s approaching the stretch goal for an additional game. There are a zillion games for iOS, but it’s sad how few are ad-free and don’t require a subscription. If you think well-crafted fun games that you can pay for once (for a very reasonable price) should be rewarded, you should join me (and others) in backing this Kickstarter. ★
Read more →

Comparing the Classic and Unified Views in iOS 26’s Phone App

Adam Engst, back in November, at TidBITS: Did you know that, regardless of view, you can now swipe left on any call to reveal a blue clock icon that lets you create a reminder to call back in 1 hour, tonight, tomorrow, or at any custom time (below left, slightly doctored)? Reminders appear at the top of the Calls list and in your default Reminders list. You can also touch and hold a call associated with a contact to connect with them in other ways (below right), or touch and hold a call from an unknown caller to add them to Contacts. I did not know this, until I read Engst’s article. One criticism I’ve seen a few times (but to be clear, not from Engst) ever since Apple debuted the new Unified interface for the Phone app back at WWDC, is that it’s somehow wrong that Apple offers it as option alongside the Classic interface. “When does Apple ever offer options like this?” I’d argue that Apple used to offer options like this all the time. The Music app on the original iPhone (which app was actually named “iPod” for a while) let you customize all the tabs at the bottom. All of Apple’s good Mac apps (the AppKit ones, primarily) still let you customize the entire toolbar. The problem isn’t that Apple now offers two very different interfaces for the Phone app. The problem is that Apple stopped offering users ways to significantly tailor apps to their own needs and tastes — and the proof that they stopped is that so many people now think it’s so strange that they’re offering two options for how the Phone app should look and work. Overall, I like the new Unified layout in the Phone app. But what I love is there remains an option for those who don’t, and that you can switch between the two in a very obvious, easily discoverable (dare I say, hard to miss) way right in the app itself. No need to dig two or three levels deep into the Settings app. You can just switch right there in the main screen of the Phone app itself. It’s things like this that keep me optimistic that Apple is still capable of great new work in UI design. ★
Read more →

Software is mostly all you need

Comments
Read more →

Grid: Forever free, local-first, browser-based 3D printing/CNC/laser slicer

Comments
Read more →

Cutting Up Curved Things

Comments
Read more →

Backseat Software

Comments
Read more →

The WiFi only works when it's raining (2024)

Comments
Read more →

The Hallucination Defense

Comments
Read more →

Gemini CLI gets its hooks into the agentic development loop

Google has added hooks to Gemini CLI, its terminal-based competitor to Anthropic’s Claude Code. Hooks ensure that Gemini CLI runs a given script or program inside of the agentic loop and bring a larger degree of control to the agentic development loop. These could be used, for example, to run security scanners or compliance checks, log tool interactions, inject relevant information into the context window, or even adjust the model’s parameters on the fly. As the Gemini CLI team notes in the announcement, “efficiency in the age of agents isn’t just about writing code faster; it’s about building custom tools that adapt to your specific environment.” Hooks in Gemini CLI (Credit: Google). While a developer could try to instruct the agent to run a specific script at certain times within the loop in the prompt or AGENTS.md file, given the non-deterministic nature of those agent models, there’s no guarantee that this will actually happen or that the agent won’t forget about this instruction over time. Claude Code did it first If this sounds familiar, it’s likely because you already know about Claude Code Hooks, which first introduced this idea last September (though there is also a GitHub issue from July 2025 that proposes this feature). Google’s implementation is not quite a one-to-one match to Anthropic’s, but it should only take a few minutes to adapt an existing Claude hook to Gemini CLI. Setting up hooks Like with hooks in Claude Code, Gemini CLI also implements roughly a dozen lifecycle events where a hook can fire. That may be right at the session start, after the user submits a prompt but before the agent starts planning (to add context, for example), before tools are selected (to optimize the tool selection or filter available tools), and similar moments in the agent loop. Defining a Gemini CLI hook (Credit: Google). The hooks are defined as JSON files that describe when they are invoked and which script they should run. Those scripts are standard Bash scripts and Google notes that it is essential to keep those hooks fast because they do run synchronously and delays in the script will also delay the agent response. Google recommends that developers use parallel operations and caching when possible to keep the operations fast. One interesting use case for hooks is to utilize the ‘AfterAgent’ hook, which fires when the agent loop ends, to force the agent into a continuous loop to work on a difficult task — while also refreshing the context between those runs to avoid context rot. As for security, it’s important to stress that hooks will have the user’s privileges, and Google notes that developers should review the source code of any third-party hooks. Hooks, which are now available as part of the Gemini CLI v0.26.0 update, can also be packaged inside Gemini CLI extensions. That’s Google’s format for packaging prompts, MCP servers, sub-agents, and agent skills — and now hooks — into a single sharable package. The post Gemini CLI gets its hooks into the agentic development loop appeared first on The New Stack.
Read more →

Flameshot

Comments
Read more →

Meet Gravitino, a geo-distributed, federated metadata lake

In the new world of agentic AI, the discussion has revolved around data: governance, storage, and compute. But what about metadata — the data about data? Metadata has been a second-class citizen, according to Junping (JP) Du, founder and CEO of Datastrato, a data and AI infrastructure company. AI is changing how data — and metadata — is consumed, understood, and governed, so Datastrato created Apache Gravitino, an open source project that serves as a high-performance, geo-distributed, federated metadata lake. The project is designed to be a single-engine, neutral control plane for metadata and governance, tailored to the needs of multimodal, multi-engine AI workloads. Last year was a big one for Gravitino. In June, it graduated as an Apache Top Level Project. In December, it delivered its first major stable release, version 1.1.0. At the start of 2026, it joined the brand new Agentic AI Foundation. Gravitino, Du says in this episode of The New Stack Makers, is a “catalog of catalogs, because we try to solve the problems of running the data and AI platforms more safely and consistently.” In the age of AI, Du says, “We need more engine-friendly or agent-friendly metadata and try to unify everything together and [provide] the technical metadata to the engine support as a first-class citizen.” Gravitino builds a unified data catalog, regardless of whether the data is traditional, structured, or multi-modal. “We all take [these] kind of data formats, and we allow the multiple engines to access this kind of data, so there’s no data silo anymore,” Du says. “And also it can be easy to consume by AI agents — instead of previously, having to be building everything to be at the data warehouse and consume from there.” Tackling metadata’s governance problem Du — who spent about 15 years building data infrastructure for the Apache Hadoop project — and Jerry (Saisai) Shao, co-founder and CTO of Datastrato, leaned on their long experience in building cloud data warehouses and lake houses in creating Gravitino. As data and AI systems grew in complexity, engineers encountered recurring problems. “The first [problem] is actually data: It’s spread across multiple engines like Spark, Trino, or even some runtimes like Ray, PyTorch. “And another problem is the metadata … It’s a siloed catalog instead of a unified catalog to know everything. So, that means the governance, access controls, and even the semantics are hard to build in efficient ways.” Metadata, Du adds, can be duplicated or inconsistent. AI makes the problem worse, he says, “especially for unstructured data, because it’s hard to manage in a typical way.” In a production environment, especially at enterprise scale, he added, it’s hard to find a single point of truth to define what data exists, how it can be accessed, and how it can be governed. Gravitino was designed to solve those issues. It was built with Java, but supports Python clients. The use cases for Gravitino include multi-cloud data consolidation, Du says. One of Datastrato’s customers is among the largest internet technology companies in the United States. “They have tons of data,” he says, including a lot of abstracted data. “The data is distributed on-prem and to public clouds. So their compute resources, especially a GPU resource, are distributed over, you know, several clouds and regions. They want the same data, right? It’s available for all these kinds of clouds and regions, so then they can trigger the training jobs or inference jobs or their applications anywhere.” Therefore, “A unified data catalog is very critical, right in this case, to make sure all this data is secure and consistent right across all the locations.” Check out the full episode to learn more about Gravitino’s use cases, how it fits into the existing commercial and open source tooling landscape, and why the project’s founders decided to donate it to the Agentic AI Foundation. The post Meet Gravitino, a geo-distributed, federated metadata lake appeared first on The New Stack.
Read more →

Ramp’s Inspect shows closed-loop AI agents are software’s future

The recent release of the background coding agent Inspect by Ramp’s engineering team serves as a definitive proof point that closed-loop agentic systems are the future of software development. It has transformed coding agents into truly autonomous engineering partners, and it is fundamentally changing the way agents deliver software. Whether teams use a custom cloud development environment (CDE) like Ramp or another approach, the signal is clear: Teams need to solve for this kind of autonomy or risk getting left behind. Modern engineers need access to coding agents that do not just generate code but also run it, verify the output, and iterate on the solution until it works. This distinction represents a fundamental shift. The industry has been focused on optimizing the “brain” of agents, solving for context windows and reasoning. Ramp’s success validates that the “body” matters just as much. The ability to interact with a runtime environment is what transforms code from a hypothesis into a solution. This verification loop separates truly autonomous coding agents from those that rely on humans to validate their work. The open-loop bottleneck Modern coding agents are impressive. They can plan complex refactors and generate thousands of lines of code. However, these agents typically operate in an open loop. They rely on the developer to act as the runtime environment. The agent proposes a solution. The human must compile, test, and interpret error messages or feed them back to the agent. The cognitive load of verification remains with the user. This workflow caps developer velocity. The speed of the agent is irrelevant if the verification process is slow. We have optimized code generation to be near instantaneous, but verification remains bound by human bandwidth and linear CI pipelines. Inspect demonstrates that closing that loop unlocks a new category of velocity. By giving the agent access to a sandbox to run builds and tests, the agent transitions from text generator to task completer. It hands off a verified solution rather than a draft. The impact is measurable. Ramp reported vertical internal adoption charts. Within months, approximately 30% of all pull requests merged to its frontend and backend repositories were written by Inspect. This penetration suggests closed-loop agents are a step function change in productivity, not a marginal improvement. The economics of curiosity The value proposition of closed-loop agents is not just delivering code faster. It is about the parallelization of solution discovery. In traditional workflows, exploring refactors or library upgrades is expensive. It requires context switching, stashing work and fighting dependency conflicts. Because experimentation costs are high, we experiment less. We stick to safe patterns to avoid the time sink of failure. Background agents change the economics of curiosity. If an engineer can spin up 10 concurrent agent sessions to explore 10 architectural approaches, the cost of failure drops significantly. Consider a team migrating a legacy component. Currently, this is a multiweek spike. In the new paradigm, a developer could instead task a fleet of agents to attempt the migration using different strategies. One agent might try a strangler fig pattern. Another might attempt a hard cutover. A third might focus on integration tests. The developer then reviews results rather than typing code. The agents run in isolated sandboxes. They build, catch syntax errors, and run test suites until they achieve a green state. The developer wakes up to three potential pull requests verified against the CI pipeline and chooses the best one. Verification beyond localhost Ramp’s Inspect platform validates within a custom-built CDE. To ensure these environments start quickly despite their complexity, a sophisticated snapshotting system keeps images warm and ready to launch. Ramp was able to extend this CDE infrastructure to also support integration testing, a brilliant engineering feat that works well for its specific context. However, for many organizations building complex, cloud native applications with high levels of dependencies, this approach faces significant hurdles. Often, the entire stack is too large to be spun up on a single virtual machine (VM) or devpod. In these scenarios, while CDEs remain excellent for replacing local development laptops, high-fidelity integration testing requires a different approach. To enable true autonomy in these complex environments, we need a way to perform integration testing without replicating the entire world. We can connect agents directly to a shared baseline environment using existing Kubernetes infrastructure. In this model, the agent deploys only the modified service to a lightweight sandbox. The infrastructure uses dynamic routing and context propagation to direct specific test traffic to that sandbox while fulfilling all other dependencies from a shared, stable baseline. This approach gives coding agents the power to execute autonomous end-to-end testing, regardless of the stack’s size or complexity. It leverages the existing cluster to provide high-fidelity context. An agent can then run integration tests against real upstream and downstream services. It sees how the change interacts with the actual message queue schema and the latency of the live database. This closes the loop with higher fidelity while lowering the infrastructure barrier. By testing against a shared cluster, the agent can catch integration regressions that might pass in a hermetic VM without requiring the platform team to build a custom orchestration engine to support it. The future of software delivery The release of Inspect is a clear signal of where software development is heading. The era of the human engineer as the sole verifier is ending. We are moving toward a world where agents operate as autonomous partners capable of exploring solutions and verifying their own work. Ramp has proven that this workflow is not science fiction. It is working in production today and is driving massive efficiency gains. The question for the rest of the industry is not whether to adopt this workflow, but how. Whether a team chooses to build a custom platform like Ramp or adopt an existing cloud native solution like Signadot to give their agents a runtime, the imperative is the same. We must provide our agents with a body. We must close the loop between generation and verification. Once we do, we unlock a level of velocity that will define the next generation of high-performing engineering teams. The post Ramp’s Inspect shows closed-loop AI agents are software’s future appeared first on The New Stack.
Read more →

PlayStation 2 Recompilation Project Is Absolutely Incredible

Comments
Read more →

County pays $600k to pentesters it arrested for assessing courthouse security

Comments
Read more →

My Mom and Dr. DeepSeek (2025)

Comments
Read more →

Prompting vs. RAG vs. fine-tuning: Why it’s not a ladder

Teams usually assume there’s a straightforward progression from prompt engineering through retrieval-augmented generation (RAG) to fine-tuning (the last rung on the ladder) when customizing large language models (LLMs). This is an easy-to-understand, frequently repeated narrative that is true for some developers but not for all teams working with LLMs in production environments. Prompt engineering, RAG, and fine-tuning are not sequential upgrades in real-world enterprise systems. Instead, they represent different architectural methods for addressing different types of problems and introduce their own limitations and failure modes. Viewing them as a linear progression creates a false narrative that can lead to brittle systems that cannot adapt to changing requirements. To assess the success or failure of LLM architectures in production environments, a six-dimensional framework outlines the actual constraints that affect whether an LLM system will function well or poorly in production: data privacy, latency, degree of control, update frequency, deployment target, and cost. When LLM architecture decisions are judged Most architectural decisions regarding LLMs are made based on assumptions rather than evaluation. It is typically only after releasing the LLM applications that teams realize that the architecture is failing to meet its intended goals. At this time, teams may face difficult questions about the performance of their released LLM: “Why is our response time inconsistent?” “Why did our costs go up this week?” “How is sensitive information showing up in our logs?” After a poorly performing architecture choice is identified, teams often use weak excuses to justify it, such as “We chose the most advanced architecture available,” or “We are doing things the way everyone else is.” These excuses do not provide sufficient detail to help a team understand why the architecture failed to meet expectations. A good architecture makes its trade-offs visible. A good architecture allows a team to articulate why a particular approach was selected, what benefits it provides, and the potential trade-offs. As a result, teams need to make informed decisions about which approach to select for their specific environment. The problem with the linear ladder model for LLMs Each of the three major approaches to customizing large language models — prompt engineering, RAG, and fine-tuning — provides a different set of capabilities and/or constraints. Each is a structural decision that will have significant implications for how the team will interact with the LLM going forward. Many teams receive recommendations on building their LLM systems that are based on a ladder model: Start with prompt engineering; if that doesn’t work, move on to RAG; if RAG doesn’t work, move on to fine-tuning. The ladder model is attractive because it is easy to understand, offers direction and purpose for teams, and conveys a sense of progress. However, the ladder model fails to account for the reality that teams are not judged on the sophistication of their architectures; instead, they are judged on whether their architectures violate the constraints of their environment. Teams are expected to meet performance, security, and reliability standards. If a team’s architecture prevents its LLM system from meeting these standards, it does not matter whether the team used the “latest and greatest” approach to building its application. Many of the failures associated with LLMs occur because the architecture does not align with the problem domain’s needs. Examples of architectural failures include: Teams experiencing high response latency and unpredictable tail times Teams experiencing rapidly increasing operational costs Teams experiencing data privacy violations and sensitive information risks Teams with systems that are difficult to update without experiencing regression. None of these failures can be addressed by moving to the next rung on the ladder. In fact, many of these failures occur specifically because a team followed the ladder without accounting for the constraints of their environment. 6 dimensions that matter in production Production success is defined by multiple independent limits rather than a single “quality” limit. The six dimensions listed below generally define which architectures are viable. There is no hierarchy of these dimensions. Generally, improving one dimension will degrade another. As there is no universally best configuration of these dimensions, there is only an intentional trade-off based on the system’s needs. These six dimensions — data privacy, latency, degree of control, update frequency, deployment target, and cost — serve as constraints on the development of LLM architectures. The following figure illustrates how these dimensions may interact without falling into a “linear ladder” trap. The figure groups these dimensions into the initial feasibility gate (non-negotiable barrier), the optimization dimension (tunable trade-off), and resultant architecture-building block combinations that may be used as hybrid models.​ Data privacy: The first feasibility barrier Data privacy is often the first serious constraint production teams encounter and it’s generally non-negotiable. The question is not whether the model vendor is “secure.” The question is whether sensitive data can ever leave the organization’s boundaries. Generally, prompt engineering sends the entire prompt, including user input, contextual information, etc., to an external inference provider. Even fine-tuning can create more privacy risk since the training data or derived gradients need to be sent to a tuning pipeline, thus providing longer-lived access than a single inference call. RAG alters the privacy surface by enabling sensitive data to remain within internal systems, while only its fragments are sent to the model. In practice, data privacy is determined by data classification. If an application handles regulated data (such as personal health information or confidential data), many architectures may quickly become infeasible unless the model is self-hosted or hosted in a controlled environment. On the other hand, if the application is public-facing and does not handle sensitive data, external APIs may be acceptable. The key takeaway is that data privacy is a barrier, not a tunable parameter. Once data privacy is identified as a barrier to using an external inference service, the entire architecture collapses. Latency: The constraint users notice first Once the data privacy constraint is addressed, latency becomes the constraint users notice. Users will perceive the system as unreliable if latency is excessive or unpredictable. The primary difference in latency among models is due to the number of architectural stages in the request path, rather than the model’s intelligence. For example, prompt engineering typically has the lowest latency since the request is only a single inference call. In addition, RAG introduces multiple stages (embedding search, retrieval, reranking, and chunk selection) that increase latency and can also generate high tail times under load. Fine-tuning typically yields fast inference paths by eliminating the need for retrieval and embedding, and by integrating them directly into the model. Using the fastest architecture as the sole basis for selecting an architecture is a mistaken approach. More often than not, the correct design is a hybrid approach. An example of this is using a low-latency routing mechanism — a small, tuned model identifies the user’s intent, classifies the query, and then fires off a higher-latency RAG pipeline only when knowledge grounding is necessary. That type of hybrid architecture protects the user experience while enabling high-precision answers when needed. In production, latency is rarely just about average response time, but rather an issue of predictably low tail latency under concurrent workloads. Degree of control: Constraining behavior and knowledge Responding quickly is irrelevant if system behavior is unstable. Degree of control, the third dimension, refers to how reliably architects can constrain the model’s behavior, outputs, and knowledge boundaries. Prompt engineering constrains the model’s behavior primarily at the output layer. While prompt engineering can constrain the structure (such as JSON schema), formatting, and localizable behavior of the output, prompt-based control is fragile because it competes with model priors, user messages, and long-term context effects. RAG constrains the model’s knowledge boundaries at the level of knowledge boundaries. RAG is not primarily used to make the model smarter. Rather, RAG is used to constrain what the model is allowed to know in a particular request. Therefore, RAG is particularly useful in regulatory environments, where it provides a transparent, governable knowledge path. Fine-tuning constrains the model’s behavior to provide consistent behavior for each request. Fine-tuning defines the tone, style, reasoning patterns, classification thresholds, and domain-specific preferences that the model uses to respond to each request. It is most valuable when the desired behavior is stable and should be baked into the model, rather than being inserted at runtime. Here again, degree of control is not one thing. Degree of control can mean: Controlling output structure Controlling knowledge sources Controlling behavioral consistency Each of these techniques constrains a different layer, and that determines what types of failures can occur. Update frequency: The cost of keeping your system current Control generally makes things rigid, and over time, the dominant cost of an architecture is not deployment, but updating it. Update frequency describes how often a system has to add new information or modify previously acceptable behavior. Prompt engineering is useful for rapid updates because modifying a prompt is simple. But as prompts expand, maintaining them becomes hard, and versioning becomes a nightmare, along with the issues that arise when prompts interact with each other. RAG is useful for quick, scalable updates because the knowledge base can change independently of the model. If your domain changes every week — such as policy changes, new product documentation, new HR procedures — RAG provides a clear mechanism to update the corpus rather than the model. Fine-tuning is slow and costly to update because it involves training and validation cycles. Fine-tuning is worthwhile only when you have stable, highly valuable behavior. When you need to frequently change the underlying knowledge, fine-tuning will be a hindrance. This is why you should follow this general rule: Keep all knowledge that changes over time outside your model. Use tuning for stable behavioral patterns; use retrieval for dynamic knowledge. Deployment target: Where the model runs Even though an architecture appears flawless on paper, deployment constraints can prevent implementation. Cloud API deployments can maximize speed to market. However, these deployments are subject to limitations related to privacy, regulatory compliance, and network latency. Deployments within the virtual private cloud/on-premises environment enable data sovereignty and internal controls, but add significant complexity to both infrastructure and operations. Edge deployments often limit model size and direct development teams toward either small, tuned models or specialized inference runtimes. Where the workload is to be deployed can limit feasibility. For example, if an organization has a data sovereignty requirement and does not permit external inference, prompt engineering via public APIs is no longer an option. For such organizations, self-hosted RAG or tuning would likely be the default, regardless of the position of either approach on the ladder. Cost: What eliminates ‘successful’ pilot projects Most LLM projects do not fail during the prototype phase. Most LLM projects fail after successful adoption when traffic grows and costs become non-linear. Cost is not just “what does the model cost per token?” It can be influenced by: The length of the prompt and the retrieved context The class/model used and the pricing of the model provider Concurrency/scaling strategy Caching efficiency GPU/CPU resource utilization for self-hosted deployments The engineering overhead required for maintaining retrieval pipelines While prompt engineering is often the least expensive initial approach, its cost can become unpredictable as the prompt and context sizes grow. RAG increases operational cost because the retrieval pipeline must always be running — vector databases, indexing jobs, and the reranker — but it can also decrease inference cost by enabling the use of smaller models and reducing the amount of work the LLM must do to fill in hallucinations. Fine-tuning has very high up-front costs (training and evaluation), but it can also reduce inference costs and latency by eliminating the need to retrieve content or reducing the number of tokens required in the prompt. The major difference here is the predictability of cost. The most dangerous systems are those that incur increased cost in proportion to their usage, such as those that frequently include large retrieved contexts or multistep LLM calls without strict budgets. In production, cost should be considered as an architectural dimension from Day 1, not a billing shock discovered after launch. Putting the dimensions together: A decision framework There isn’t a single “right” solution for all applications. The appropriate architecture will depend on the relative importance of the six dimensions. You can use the six dimensions in a particular sequence when determining which method(s) best fit your application: Data privacy: Are you allowing sensitive information to cross the application boundary? If not, you need to eliminate any external API calls. Deployment target: Will your application run on your required platform? Eliminate any methods that don’t support your application’s deployment target. Latency: Can your architecture deliver the necessary low latency during periods of high load? Can your architecture meet the performance expectations of your users or customers during high-load situations? Cost: Will your architecture be economically viable under high production traffic loads? Will your architecture remain economically viable as request volume increases? Update frequency: How difficult is it to adapt your architecture as customer expectations evolve over time? How costly will changing your architecture be when changes occur due to evolving customer requirements? Degree of control: To what extent do you want to control potential failure points of your architecture to minimize downtime and the associated lost revenue? Once you have determined the relative importance of each dimension, you should create an architecture composed of multiple mechanisms that work together, rather than simply choosing a single “best” mechanism. In real-world enterprise applications, many successful architectures are hybrids: Fine-tuning (or lightweight adapters) establishes stable behavior patterns. RAG provides governable and regularly updated knowledge. Prompt engineering enforces structured output and task-level runtime constraints. The six-dimensional framework explains why “prompt vs. RAG vs. fine-tuning” is the wrong question. Instead, ask yourself: Which mechanisms should I include in my architecture based upon the constraints identified above? To make this decision-making process more tangible, the following flowchart outlines a practical approach to evaluating your LLM architecture across the six dimensions. Conclusion The “ladder” of developing a production LLM system — prompt engineering → RAG → fine-tuning — may appear attractive as it greatly simplifies the process of making decisions about how to develop your architecture. However, the reality is that production LLM systems are not developed by being sophisticated. Production LLM systems are developed by being constrained. A six-dimensional framework helps identify the trade-offs involved in developing an architecture and ensures that development teams do not treat technology selection as a matter of ideology. When developing an architecture, teams can use the six dimensions to determine which mechanisms to incorporate and design hybrid systems that will withstand real user interactions. Do not try to pursue the most advanced technique. Pursue building an application that is both safe and economically viable. The post Prompting vs. RAG vs. fine-tuning: Why it’s not a ladder appeared first on The New Stack.
Read more →

Project Genie: Experimenting with infinite, interactive worlds

Comments
Read more →

Reflex (YC W23) Senior Software Engineer Infra

Comments
Read more →

Launch HN: AgentMail (YC S25) – An API that gives agents their own email inboxes

Comments
Read more →

Drug trio found to block tumour resistance in pancreatic cancer in mouse models

Comments
Read more →

Is the RAM shortage killing small VPS hosts?

Comments
Read more →

Deep dive into Turso, the “SQLite rewrite in Rust”

Comments
Read more →

How to choose colors for your CLI applications (2023)

Comments
Read more →

Moltworker: a self-hosted personal AI agent, minus the minis

Comments
Read more →

Waymo robotaxi hits a child near an elementary school in Santa Monica

Comments
Read more →

Claude Code daily benchmarks for degradation tracking

Comments
Read more →

A lot of population numbers are fake

Comments
Read more →

AGENTS.md outperforms skills in our agent evals

Comments
Read more →

NeuroAI and Beyond

arXiv:2601.19955v1 Announce Type: new Abstract: Neuroscience and Artificial Intelligence (AI) have made significant progress in the past few years but have only been loosely inter-connected. Based on a workshop held in August 2025, we identify current and future areas of synergism between these two fields. We focus on the subareas of embodiment, language and communication, robotics, learning in humans and machines and Neuromorphic engineering to take stock of the progress made so far, and possible promising new future avenues. Overall, we advocate for the development of NeuroAI, a type of Neuroscience-informed Artificial Intelligence that, we argue, has the potential for significantly improving the scope and efficiency of AI algorithms while simultaneously changing the way we understand biological neural computations. We include personal statements from several leading researchers on their diverse views of NeuroAI. Two Strength-Weakness-Opportunities-Threat (SWOT) analyses by researchers and trainees are appended that describe the benefits and risks offered by NeuroAI.
Read more →

Teaching LLMs to Ask: Self-Querying Category-Theoretic Planning for Under-Specified Reasoning

arXiv:2601.20014v1 Announce Type: new Abstract: Inference-time planning with large language models frequently breaks under partial observability: when task-critical preconditions are not specified at query time, models tend to hallucinate missing facts or produce plans that violate hard constraints. We introduce \textbf{Self-Querying Bidirectional Categorical Planning (SQ-BCP)}, which explicitly represents precondition status (\texttt{Sat}/\texttt{Viol}/\texttt{Unk}) and resolves unknowns via (i) targeted self-queries to an oracle/user or (ii) \emph{bridging} hypotheses that establish the missing condition through an additional action. SQ-BCP performs bidirectional search and invokes a pullback-based verifier as a categorical certificate of goal compatibility, while using distance-based scores only for ranking and pruning. We prove that when the verifier succeeds and hard constraints pass deterministic checks, accepted plans are compatible with goal requirements; under bounded branching and finite resolution depth, SQ-BCP finds an accepting plan when one exists. Across WikiHow and RecipeNLG tasks with withheld preconditions, SQ-BCP reduces resource-violation rates to \textbf{14.9\%} and \textbf{5.8\%} (vs.\ \textbf{26.0\%} and \textbf{15.7\%} for the best baseline), while maintaining competitive reference quality.
Read more →

Fuzzy Categorical Planning: Autonomous Goal Satisfaction with Graded Semantic Constraints

arXiv:2601.20021v1 Announce Type: new Abstract: Natural-language planning often involves vague predicates (e.g., suitable substitute, stable enough) whose satisfaction is inherently graded. Existing category-theoretic planners provide compositional structure and pullback-based hard-constraint verification, but treat applicability as crisp, forcing thresholding that collapses meaningful distinctions and cannot track quality degradation across multi-step plans. We propose Fuzzy Category-theoretic Planning (FCP), which annotates each action (morphism) with a degree in [0,1], composes plan quality via a t-norm Lukasiewicz, and retains crisp executability checks via pullback verification. FCP grounds graded applicability from language using an LLM with k-sample median aggregation and supports meeting-in-the-middle search using residuum-based backward requirements. We evaluate on (i) public PDDL3 preference/oversubscription benchmarks and (ii) RecipeNLG-Subs, a missing-substitute recipe-planning benchmark built from RecipeNLG with substitution candidates from Recipe1MSubs and FoodKG. FCP improves success and reduces hard-constraint violations on RecipeNLG-Subs compared to LLM-only and ReAct-style baselines, while remaining competitive with classical PDDL3 planners.
Read more →

Insight Agents: An LLM-Based Multi-Agent System for Data Insights

arXiv:2601.20048v1 Announce Type: new Abstract: Today, E-commerce sellers face several key challenges, including difficulties in discovering and effectively utilizing available programs and tools, and struggling to understand and utilize rich data from various tools. We therefore aim to develop Insight Agents (IA), a conversational multi-agent Data Insight system, to provide E-commerce sellers with personalized data and business insights through automated information retrieval. Our hypothesis is that IA will serve as a force multiplier for sellers, thereby driving incremental seller adoption by reducing the effort required and increase speed at which sellers make good business decisions. In this paper, we introduce this novel LLM-backed end-to-end agentic system built on a plan-and-execute paradigm and designed for comprehensive coverage, high accuracy, and low latency. It features a hierarchical multi-agent structure, consisting of manager agent and two worker agents: data presentation and insight generation, for efficient information retrieval and problem-solving. We design a simple yet effective ML solution for manager agent that combines Out-of-Domain (OOD) detection using a lightweight encoder-decoder model and agent routing through a BERT-based classifier, optimizing both accuracy and latency. Within the two worker agents, a strategic planning is designed for API-based data model that breaks down queries into granular components to generate more accurate responses, and domain knowledge is dynamically injected to to enhance the insight generator. IA has been launched for Amazon sellers in US, which has achieved high accuracy of 90% based on human evaluation, with latency of P90 below 15s.
Read more →

Should I Have Expressed a Different Intent? Counterfactual Generation for LLM-Based Autonomous Control

arXiv:2601.20090v1 Announce Type: new Abstract: Large language model (LLM)-powered agents can translate high-level user intents into plans and actions in an environment. Yet after observing an outcome, users may wonder: What if I had phrased my intent differently? We introduce a framework that enables such counterfactual reasoning in agentic LLM-driven control scenarios, while providing formal reliability guarantees. Our approach models the closed-loop interaction between a user, an LLM-based agent, and an environment as a structural causal model (SCM), and leverages test-time scaling to generate multiple candidate counterfactual outcomes via probabilistic abduction. Through an offline calibration phase, the proposed conformal counterfactual generation (CCG) yields sets of counterfactual outcomes that are guaranteed to contain the true counterfactual outcome with high probability. We showcase the performance of CCG on a wireless network control use case, demonstrating significant advantages compared to naive re-execution baselines.
Read more →

Towards Intelligent Urban Park Development Monitoring: LLM Agents for Multi-Modal Information Fusion and Analysis

arXiv:2601.20206v1 Announce Type: new Abstract: As an important part of urbanization, the development monitoring of newly constructed parks is of great significance for evaluating the effect of urban planning and optimizing resource allocation. However, traditional change detection methods based on remote sensing imagery have obvious limitations in high-level and intelligent analysis, and thus are difficult to meet the requirements of current urban planning and management. In face of the growing demand for complex multi-modal data analysis in urban park development monitoring, these methods often fail to provide flexible analysis capabilities for diverse application scenarios. This study proposes a multi-modal LLM agent framework, which aims to make full use of the semantic understanding and reasoning capabilities of LLM to meet the challenges in urban park development monitoring. In this framework, a general horizontal and vertical data alignment mechanism is designed to ensure the consistency and effective tracking of multi-modal data. At the same time, a specific toolkit is constructed to alleviate the hallucination issues of LLM due to the lack of domain-specific knowledge. Compared to vanilla GPT-4o and other agents, our approach enables robust multi-modal information fusion and analysis, offering reliable and scalable solutions tailored to the diverse and evolving demands of urban park development monitoring.
Read more →

Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning

arXiv:2601.20221v1 Announce Type: new Abstract: Large language models have achieved strong performance on medical reasoning benchmarks, yet their deployment in clinical settings demands rigorous verification to ensure factual accuracy. While reward models offer a scalable approach for reasoning trace verification, existing methods face two limitations: they produce only scalar reward values without explicit justification, and they rely on single-pass retrieval that precludes adaptive knowledge access as verification unfolds. We introduce $\method$, an agentic framework that addresses these limitations by training medical reasoning verifiers to iteratively query external medical corpora during evaluation. Our approach combines tool-augmented verification with an iterative reinforcement learning paradigm that requires only trace-level supervision, alongside an adaptive curriculum mechanism that dynamically adjusts training data distribution. Across four medical reasoning benchmarks, $\method$ achieves substantial gains over existing methods, improving MedQA accuracy by 23.5% and MedXpertQA by 32.0% relative to the base generator in particular. Crucially, $\method$ demonstrates an $\mathbf{8\times}$ reduction in sampling budget requirement compared to prior reward model baselines. These findings establish that grounding verification in dynamically retrieved evidence offers a principled path toward more reliable medical reasoning systems.
Read more →

Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models

arXiv:2601.20305v1 Announce Type: new Abstract: Unified Multimodal Models (UMMs) exhibit strong understanding, yet this capability often fails to effectively guide generation. We identify this as a Cognitive Gap: the model lacks the understanding of how to enhance its own generation process. To bridge this gap, we propose Endogenous Reprompting, a mechanism that transforms the model's understanding from a passive encoding process into an explicit generative reasoning step by generating self-aligned descriptors during generation. To achieve this, we introduce SEER (Self-Evolving Evaluator and Reprompter), a training framework that establishes a two-stage endogenous loop using only 300 samples from a compact proxy task, Visual Instruction Elaboration. First, Reinforcement Learning with Verifiable Rewards (RLVR) activates the model's latent evaluation ability via curriculum learning, producing a high-fidelity endogenous reward signal. Second, Reinforcement Learning with Model-rewarded Thinking (RLMT) leverages this signal to optimize the generative reasoning policy. Experiments show that SEER consistently outperforms state-of-the-art baselines in evaluation accuracy, reprompting efficiency, and generation quality, without sacrificing general multimodal capabilities.
Read more →

ECG-Agent: On-Device Tool-Calling Agent for ECG Multi-Turn Dialogue

arXiv:2601.20323v1 Announce Type: new Abstract: Recent advances in Multimodal Large Language Models have rapidly expanded to electrocardiograms, focusing on classification, report generation, and single-turn QA tasks. However, these models fall short in real-world scenarios, lacking multi-turn conversational ability, on-device efficiency, and precise understanding of ECG measurements such as the PQRST intervals. To address these limitations, we introduce ECG-Agent, the first LLM-based tool-calling agent for multi-turn ECG dialogue. To facilitate its development and evaluation, we also present ECG-Multi-Turn-Dialogue (ECG-MTD) dataset, a collection of realistic user-assistant multi-turn dialogues for diverse ECG lead configurations. We develop ECG-Agents in various sizes, from on-device capable to larger agents. Experimental results show that ECG-Agents outperform baseline ECG-LLMs in response accuracy. Furthermore, on-device agents achieve comparable performance to larger agents in various evaluations that assess response accuracy, tool-calling ability, and hallucinations, demonstrating their viability for real-world applications.
Read more →

AMA: Adaptive Memory via Multi-Agent Collaboration

arXiv:2601.20352v1 Announce Type: new Abstract: The rapid evolution of Large Language Model (LLM) agents has necessitated robust memory systems to support cohesive long-term interaction and complex reasoning. Benefiting from the strong capabilities of LLMs, recent research focus has shifted from simple context extension to the development of dedicated agentic memory systems. However, existing approaches typically rely on rigid retrieval granularity, accumulation-heavy maintenance strategies, and coarse-grained update mechanisms. These design choices create a persistent mismatch between stored information and task-specific reasoning demands, while leading to the unchecked accumulation of logical inconsistencies over time. To address these challenges, we propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities. AMA employs a hierarchical memory design that dynamically aligns retrieval granularity with task complexity. Specifically, the Constructor and Retriever jointly enable multi-granularity memory construction and adaptive query routing. The Judge verifies the relevance and consistency of retrieved content, triggering iterative retrieval when evidence is insufficient or invoking the Refresher upon detecting logical conflicts. The Refresher then enforces memory consistency by performing targeted updates or removing outdated entries. Extensive experiments on challenging long-context benchmarks show that AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods, demonstrating its effectiveness in maintaining retrieval precision and long-term memory consistency.
Read more →

Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution

arXiv:2601.20379v1 Announce Type: new Abstract: Large language models (LLMs) struggle with complex, long-horizon reasoning due to instability caused by their frozen policy assumption. Current test-time scaling methods treat execution feedback merely as an external signal for filtering or rewriting trajectories, without internalizing it to improve the underlying reasoning strategy. Inspired by Popper's epistemology of "conjectures and refutations," we argue that intelligence requires real-time evolution of the model's policy through learning from failed attempts. We introduce Policy of Thoughts (PoT), a framework that recasts reasoning as a within-instance online optimization process. PoT first generates diverse candidate solutions via an efficient exploration mechanism, then uses Group Relative Policy Optimization (GRPO) to update a transient LoRA adapter based on execution feedback. This closed-loop design enables dynamic, instance-specific refinement of the model's reasoning priors. Experiments show that PoT dramatically boosts performance: a 4B model achieves 49.71% accuracy on LiveCodeBench, outperforming GPT-4o and DeepSeek-V3 despite being over 50 smaller.
Read more →

OmegaUse: Building a General-Purpose GUI Agent for Autonomous Task Execution

arXiv:2601.20380v1 Announce Type: new Abstract: Graphical User Interface (GUI) agents show great potential for enabling foundation models to complete real-world tasks, revolutionizing human-computer interaction and improving human productivity. In this report, we present OmegaUse, a general-purpose GUI agent model for autonomous task execution on both mobile and desktop platforms, supporting computer-use and phone-use scenarios. Building an effective GUI agent model relies on two factors: (1) high-quality data and (2) effective training methods. To address these, we introduce a carefully engineered data-construction pipeline and a decoupled training paradigm. For data construction, we leverage rigorously curated open-source datasets and introduce a novel automated synthesis framework that integrates bottom-up autonomous exploration with top-down taxonomy-guided generation to create high-fidelity synthetic data. For training, to better leverage these data, we adopt a two-stage strategy: Supervised Fine-Tuning (SFT) to establish fundamental interaction syntax, followed by Group Relative Policy Optimization (GRPO) to improve spatial grounding and sequential planning. To balance computational efficiency with agentic reasoning capacity, OmegaUse is built on a Mixture-of-Experts (MoE) backbone. To evaluate cross-terminal capabilities in an offline setting, we introduce OS-Nav, a benchmark suite spanning multiple operating systems: ChiM-Nav, targeting Chinese Android mobile environments, and Ubu-Nav, focusing on routine desktop interactions on Ubuntu. Extensive experiments show that OmegaUse is highly competitive across established GUI benchmarks, achieving a state-of-the-art (SOTA) score of 96.3% on ScreenSpot-V2 and a leading 79.1% step success rate on AndroidControl. OmegaUse also performs strongly on OS-Nav, reaching 74.24% step success on ChiM-Nav and 55.9% average success on Ubu-Nav.
Read more →

CtrlCoT: Dual-Granularity Chain-of-Thought Compression for Controllable Reasoning

arXiv:2601.20467v1 Announce Type: new Abstract: Chain-of-thought (CoT) prompting improves LLM reasoning but incurs high latency and memory cost due to verbose traces, motivating CoT compression with preserved correctness. Existing methods either shorten CoTs at the semantic level, which is often conservative, or prune tokens aggressively, which can miss task-critical cues and degrade accuracy. Moreover, combining the two is non-trivial due to sequential dependency, task-agnostic pruning, and distribution mismatch. We propose \textbf{CtrlCoT}, a dual-granularity CoT compression framework that harmonizes semantic abstraction and token-level pruning through three components: Hierarchical Reasoning Abstraction produces CoTs at multiple semantic granularities; Logic-Preserving Distillation trains a logic-aware pruner to retain indispensable reasoning cues (e.g., numbers and operators) across pruning ratios; and Distribution-Alignment Generation aligns compressed traces with fluent inference-time reasoning styles to avoid fragmentation. On MATH-500 with Qwen2.5-7B-Instruct, CtrlCoT uses 30.7\% fewer tokens while achieving 7.6 percentage points higher than the strongest baseline, demonstrating more efficient and reliable reasoning. Our code will be publicly available at https://github.com/fanzhenxuan/Ctrl-CoT.
Read more →

Normative Equivalence in human-AI Cooperation: Behaviour, Not Identity, Drives Cooperation in Mixed-Agent Groups

arXiv:2601.20487v1 Announce Type: new Abstract: The introduction of artificial intelligence (AI) agents into human group settings raises essential questions about how these novel participants influence cooperative social norms. While previous studies on human-AI cooperation have primarily focused on dyadic interactions, little is known about how integrating AI agents affects the emergence and maintenance of cooperative norms in small groups. This study addresses this gap through an online experiment using a repeated four-player Public Goods Game (PGG). Each group consisted of three human participants and one bot, which was framed either as human or AI and followed one of three predefined decision strategies: unconditional cooperation, conditional cooperation, or free-riding. In our sample of 236 participants, we found that reciprocal group dynamics and behavioural inertia primarily drove cooperation. These normative mechanisms operated identically across conditions, resulting in cooperation levels that did not differ significantly between human and AI labels. Furthermore, we found no evidence of differences in norm persistence in a follow-up Prisoner's Dilemma, or in participants' normative perceptions. Participants' behaviour followed the same normative logic across human and AI conditions, indicating that cooperation depended on group behaviour rather than partner identity. This supports a pattern of normative equivalence, in which the mechanisms that sustain cooperation function similarly in mixed human-AI and all human groups. These findings suggest that cooperative norms are flexible enough to extend to artificial agents, blurring the boundary between humans and AI in collective decision-making.
Read more →

PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs

arXiv:2601.20539v1 Announce Type: new Abstract: Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks' reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi-agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM-based AHD from trial-and-error evolution toward state-aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.
Read more →

Online Risk-Averse Planning in POMDPs Using Iterated CVaR Value Function

arXiv:2601.20554v1 Announce Type: new Abstract: We study risk-sensitive planning under partial observability using the dynamic risk measure Iterated Conditional Value-at-Risk (ICVaR). A policy evaluation algorithm for ICVaR is developed with finite-time performance guarantees that do not depend on the cardinality of the action space. Building on this foundation, three widely used online planning algorithms--Sparse Sampling, Particle Filter Trees with Double Progressive Widening (PFT-DPW), and Partially Observable Monte Carlo Planning with Observation Widening (POMCPOW)--are extended to optimize the ICVaR value function rather than the expectation of the return. Our formulations introduce a risk parameter $\alpha$, where $\alpha = 1$ recovers standard expectation-based planning and $\alpha < 1$ induces increasing risk aversion. For ICVaR Sparse Sampling, we establish finite-time performance guarantees under the risk-sensitive objective, which further enable a novel exploration strategy tailored to ICVaR. Experiments on benchmark POMDP domains demonstrate that the proposed ICVaR planners achieve lower tail risk compared to their risk-neutral counterparts.
Read more →

Dialogical Reasoning Across AI Architectures: A Multi-Model Framework for Testing AI Alignment Strategies

arXiv:2601.20604v1 Announce Type: new Abstract: This paper introduces a methodological framework for empirically testing AI alignment strategies through structured multi-model dialogue. Drawing on Peace Studies traditions - particularly interest-based negotiation, conflict transformation, and commons governance - we operationalize Viral Collaborative Wisdom (VCW), an approach that reframes alignment from a control problem to a relationship problem developed through dialogical reasoning. Our experimental design assigns four distinct roles (Proposer, Responder, Monitor, Translator) to different AI systems across six conditions, testing whether current large language models can engage substantively with complex alignment frameworks. Using Claude, Gemini, and GPT-4o, we conducted 72 dialogue turns totaling 576,822 characters of structured exchange. Results demonstrate that AI systems can engage meaningfully with Peace Studies concepts, surface complementary objections from different architectural perspectives, and generate emergent insights not present in initial framings - including the novel synthesis of "VCW as transitional framework." Cross-architecture patterns reveal that different models foreground different concerns: Claude emphasized verification challenges, Gemini focused on bias and scalability, and GPT-4o highlighted implementation barriers. The framework provides researchers with replicable methods for stress-testing alignment proposals before implementation, while the findings offer preliminary evidence about AI capacity for the kind of dialogical reasoning VCW proposes. We discuss limitations, including the observation that dialogues engaged more with process elements than with foundational claims about AI nature, and outline directions for future research including human-AI hybrid protocols and extended dialogue studies.
Read more →

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

arXiv:2601.20614v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP-ML/MathForge.
Read more →

Investigating the Development of Task-Oriented Communication in Vision-Language Models

arXiv:2601.20641v1 Announce Type: new Abstract: We investigate whether \emph{LLM-based agents} can develop task-oriented communication protocols that differ from standard natural language in collaborative reasoning tasks. Our focus is on two core properties such task-oriented protocols may exhibit: Efficiency -- conveying task-relevant information more concisely than natural language, and Covertness -- becoming difficult for external observers to interpret, raising concerns about transparency and control. To investigate these aspects, we use a referential-game framework in which vision-language model (VLM) agents communicate, providing a controlled, measurable setting for evaluating language variants. Experiments show that VLMs can develop effective, task-adapted communication patterns. At the same time, they can develop covert protocols that are difficult for humans and external agents to interpret. We also observe spontaneous coordination between similar models without explicitly shared protocols. These findings highlight both the potential and the risks of task-oriented communication, and position referential games as a valuable testbed for future work in this area.
Read more →

Enterprise Resource Planning Using Multi-type Transformers in Ferro-Titanium Industry

arXiv:2601.20696v1 Announce Type: new Abstract: Combinatorial optimization problems such as the Job-Shop Scheduling Problem (JSP) and Knapsack Problem (KP) are fundamental challenges in operations research, logistics, and eterprise resource planning (ERP). These problems often require sophisticated algorithms to achieve near-optimal solutions within practical time constraints. Recent advances in deep learning have introduced transformer-based architectures as promising alternatives to traditional heuristics and metaheuristics. We leverage the Multi-Type Transformer (MTT) architecture to address these benchmarks in a unified framework. We present an extensive experimental evaluation across standard benchmark datasets for JSP and KP, demonstrating that MTT achieves competitive performance on different size of these benchmark problems. We showcase the potential of multi-type attention on a real application in Ferro-Titanium industry. To the best of our knowledge, we are the first to apply multi-type transformers in real manufacturing.
Read more →

Implementing Metric Temporal Answer Set Programming

arXiv:2601.20735v1 Announce Type: new Abstract: We develop a computational approach to Metric Answer Set Programming (ASP) to allow for expressing quantitative temporal constraints, like durations and deadlines. A central challenge is to maintain scalability when dealing with fine-grained timing constraints, which can significantly exacerbate ASP's grounding bottleneck. To address this issue, we leverage extensions of ASP with difference constraints, a simplified form of linear constraints, to handle time-related aspects externally. Our approach effectively decouples metric ASP from the granularity of time, resulting in a solution that is unaffected by time precision.
Read more →

REASON: Accelerating Probabilistic Logical Reasoning for Scalable Neuro-Symbolic Intelligence

arXiv:2601.20784v1 Announce Type: new Abstract: Neuro-symbolic AI systems integrate neural perception with symbolic reasoning to enable data-efficient, interpretable, and robust intelligence beyond purely neural models. Although this compositional paradigm has shown superior performance in domains such as reasoning, planning, and verification, its deployment remains challenging due to severe inefficiencies in symbolic and probabilistic inference. Through systematic analysis of representative neuro-symbolic workloads, we identify probabilistic logical reasoning as the inefficiency bottleneck, characterized by irregular control flow, low arithmetic intensity, uncoalesced memory accesses, and poor hardware utilization on CPUs and GPUs. This paper presents REASON, an integrated acceleration framework for probabilistic logical reasoning in neuro-symbolic AI. REASON introduces a unified directed acyclic graph representation that captures common structure across symbolic and probabilistic models, coupled with adaptive pruning and regularization. At the architecture level, REASON features a reconfigurable, tree-based processing fabric optimized for irregular traversal, symbolic deduction, and probabilistic aggregation. At the system level, REASON is tightly integrated with GPU streaming multiprocessors through a programmable interface and multi-level pipeline that efficiently orchestrates compositional execution. Evaluated across six neuro-symbolic workloads, REASON achieves 12-50x speedup and 310-681x energy efficiency over desktop and edge GPUs under TSMC 28 nm node. REASON enables real-time probabilistic logical reasoning, completing end-to-end tasks in 0.8 s with 6 mm2 area and 2.12 W power, demonstrating that targeted acceleration of probabilistic logical reasoning is critical for practical and scalable neuro-symbolic AI and positioning REASON as a foundational system architecture for next-generation cognitive intelligence.
Read more →

MemCtrl: Using MLLMs as Active Memory Controllers on Embodied Agents

arXiv:2601.20831v1 Announce Type: new Abstract: Foundation models rely on in-context learning for personalized decision making. The limited size of this context window necessitates memory compression and retrieval systems like RAG. These systems however often treat memory as large offline storage spaces, which is unfavorable for embodied agents that are expected to operate under strict memory and compute constraints, online. In this work, we propose MemCtrl, a novel framework that uses Multimodal Large Language Models (MLLMs) for pruning memory online. MemCtrl augments MLLMs with a trainable memory head \mu that acts as a gate to determine which observations or reflections to retain, update, or discard during exploration. We evaluate with training two types of \mu, 1) via an offline expert, and 2) via online RL, and observe significant improvement in overall embodied task completion ability on \mu-augmented MLLMs. In particular, on augmenting two low performing MLLMs with MemCtrl on multiple subsets of the EmbodiedBench benchmark, we observe that \mu-augmented MLLMs show an improvement of around 16% on average, with over 20% on specific instruction subsets. Finally, we present a qualitative analysis on the memory fragments collected by \mu, noting the superior performance of \mu augmented MLLMs on long and complex instruction types.
Read more →

Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve)

arXiv:2601.20843v1 Announce Type: new Abstract: This paper introduces a novel Deep Researcher architecture designed to generate detailed research reports on complex PhD level topics by addressing the inherent limitations of the Parallel Scaling paradigm. Our system utilizes two key innovations: Sequential Research Plan Refinement via Reflection and a Candidates Crossover algorithm. The sequential refinement process is demonstrated as an efficient method that allows the agent to maintain a centralized Global Research Context, enabling it to look back at current progress, reason about the research plan, and intelligently make changes at runtime. This dynamic adaptation contrasts with parallel approaches, which often suffer from siloed knowledge. The Candidates Crossover algorithm further enhances search efficiency by deploying multiple LLM candidates with varied parameters to explore a larger search space, with their findings synthesized to curate a comprehensive final research response. The process concludes with One Shot Report Generation, ensuring the final document is informed by a unified narrative and high fact density. Powered by the Gemini 2.5 Pro model, our Deep Researcher was evaluated on the DeepResearch Bench, a globally recognized benchmark of 100 doctoral level research tasks. Our architecture achieved an overall score of 46.21, demonstrating superior performance by surpassing leading deep research agents such as Claude Researcher, Nvidia AIQ Research Assistant, Perplexity Research, Kimi Researcher and Grok Deeper Search present on the DeepResearch Bench actively running leaderboard. This performance marginally exceeds our previous work, Static DRA, and reinforces the finding that sequential scaling consistently outperforms the parallel self consistency paradigm.
Read more →

SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models

arXiv:2601.20856v1 Announce Type: new Abstract: Although the capabilities of large language models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting a fundamental constraint on forward planning capacity. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting inherent architectural limitations which might not be overcome by test-time scaling approaches alone.
Read more →

STELLAR: Structure-guided LLM Assertion Retrieval and Generation for Formal Verification

arXiv:2601.19903v1 Announce Type: cross Abstract: Formal Verification (FV) relies on high-quality SystemVerilog Assertions (SVAs), but the manual writing process is slow and error-prone. Existing LLM-based approaches either generate assertions from scratch or ignore structural patterns in hardware designs and expert-crafted assertions. This paper presents STELLAR, the first framework that guides LLM-based SVA generation with structural similarity. STELLAR represents RTL blocks as AST structural fingerprints, retrieves structurally relevant (RTL, SVA) pairs from a knowledge base, and integrates them into structure-guided prompts. Experiments show that STELLAR achieves superior syntax correctness, stylistic alignment, and functional correctness, highlighting structure-aware retrieval as a promising direction for industrial FV.
Read more →

DABench-LLM: Standardized and In-Depth Benchmarking of Post-Moore Dataflow AI Accelerators for LLMs

arXiv:2601.19904v1 Announce Type: cross Abstract: The exponential growth of large language models has outpaced the capabilities of traditional CPU and GPU architectures due to the slowdown of Moore's Law. Dataflow AI accelerators present a promising alternative; however, there remains a lack of in-depth performance analysis and standardized benchmarking methodologies for LLM training. We introduce DABench-LLM, the first benchmarking framework designed for evaluating LLM workloads on dataflow-based accelerators. By combining intra-chip performance profiling and inter-chip scalability analysis, DABench-LLM enables comprehensive evaluation across key metrics such as resource allocation, load balance, and resource efficiency. The framework helps researchers rapidly gain insights into underlying hardware and system behaviors, and provides guidance for performance optimizations. We validate DABench-LLM on three commodity dataflow accelerators, Cerebras WSE-2, SambaNova RDU, and Graphcore IPU. Our framework reveals performance bottlenecks and provides specific optimization strategies, demonstrating its generality and effectiveness across a diverse range of dataflow-based AI hardware platforms.
Read more →

GTAC: A Generative Transformer for Approximate Circuits

arXiv:2601.19906v1 Announce Type: cross Abstract: Targeting error-tolerant applications, approximate circuits introduce controlled errors to significantly improve performance, power, and area (PPA) of circuits. In this work, we introduce GTAC, a novel generative Transformer-based model for producing approximate circuits. By leveraging principles of approximate computing and AI-driven EDA, our model innovatively integrates error thresholds into the design process. Experimental results show that compared with a state-of-the-art method, GTAC further reduces 6.4% area under the error rate constraint, while being 4.3x faster.
Read more →

Analysis of LLM Vulnerability to GPU Soft Errors: An Instruction-Level Fault Injection Study

arXiv:2601.19912v1 Announce Type: cross Abstract: Large language models (LLMs) are highly compute- and memory-intensive, posing significant demands on high-performance GPUs. At the same time, advances in GPU technology driven by shrinking transistor sizes and lower operating voltages have made these devices increasingly susceptible to soft errors. While prior work has examined GPU reliability, most studies have focused on general-purpose applications or conventional neural networks mostly used for vision tasks such as classification and detection. In contrast, systematic analysis of modern large-scale LLMs remains limited, despite their rapid adoption in diverse application scenarios. Given the unique characteristics of LLMs, their resilience to soft errors may differ substantially from earlier models. To bridge this gap, we conduct the first instruction-level fault injection study of LLM inference. Our approach reveals reliability characteristics from multiple perspectives, highlighting the effects of model architecture, parameter scale, and task complexity. These findings provide new insights into LLM reliability and inform the design of more effective fault tolerance mechanisms.
Read more →

From Intuition to Expertise: Rubric-Based Cognitive Calibration for Human Detection of LLM-Generated Korean Text

arXiv:2601.19913v1 Announce Type: cross Abstract: Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for linguistically trained readers, who can over-trust surface well-formedness. We study whether expert detection can be treated as a learnable skill and improved through structured calibration. We introduce LREAD, a rubric derived from national Korean writing standards and adapted to target micro-level artifacts (e.g., punctuation optionality, spacing behavior, and register shifts). In a three-phase longitudinal blind protocol with Korean linguistics majors, Phase 1 measures intuition-only detection, Phase 2 enforces criterion-level scoring with explicit justifications, and Phase 3 evaluates domain-focused mastery on held-out elementary essays. Across phases, majority-vote accuracy increases from 60% to 100%, accompanied by stronger inter-annotator agreement (Fleiss' kappa: -0.09 --> 0.82). Compared to state-of-the-art LLM detectors, calibrated humans rely more on language-specific micro-diagnostics that are not well captured by coarse discourse priors. Our findings suggest that rubric-scaffolded expert judgment can serve as an interpretable complement to automated detectors for non-English settings, and we release the full rubric and a taxonomy of calibrated detection signatures.
Read more →

Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

arXiv:2601.19914v1 Announce Type: cross Abstract: Synthetic data has proven itself to be a valuable resource for tuning smaller, cost-effective language models to handle the complexities of multi-turn tool calling conversations. While many frameworks and systems for producing synthetic multi-turn tool calling data have been proposed, prior works have frequently assumed that any tool calling interactions will take place in an execution environment that maintains state. When such an environment is available, this is advantageous as it allows for the validity of an interaction to be determined by whether or not the state of the execution environment matches to some prespecified objective. Unfortunately, this does not hold in many real-world tool use settings, e.g., in enterprise settings where data security is of the utmost importance or in cases where tool specifications are synthesized from multiple sources. In this work, we address this gap by introducing a data generation method, DiGiT-TC, that is designed to produce tool calling conversations that have the characteristics of conversations generated through search in a stateful environment. The key to our technique lies in a novel generation pattern that allows our approach to implicitly represent certain tool calls in the user request. We validate our approach on standard tool calling benchmarks and demonstrate that, even in stateful problem settings, our approach results in strong performance gains.
Read more →

Modeling Next-Token Prediction as Left-Nested Intuitionistic Implication

arXiv:2601.19915v1 Announce Type: cross Abstract: We introduce the \emph{Arrow Language Model}, a neural architecture derived from an intuitionistic-logic interpretation of next-token prediction. Instead of representing tokens as additive embeddings mixed by attention, we encode a prefix as a \emph{left-nested implication chain} whose structure preserves order through non-commutative composition. Next-token prediction corresponds to \emph{modus ponens}, and sequence processing becomes constructive proof extension under the Curry--Howard correspondence. Our Prolog-based specialized theorem provers validate fundamental properties of the neural models, among which relations between commutative vs. non-commutative sequencing and single-token vs. multi-token prediction choices. We show that a neural architecture equivalent to multiplicative RNNs arises naturally from a proof-theoretic interpretation of next-token prediction as nested intuitionistic implication, we present a practical low-rank neural realization and position the model relative to Transformers and state-space models. Keywords: logic-based derivation of neural architectures, intuitionistic implicational logic, token-as-operator neural models, state-space models, alternatives to transformer-based foundational models.
Read more →

FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition

arXiv:2601.19919v1 Announce Type: cross Abstract: Knowledge distillation is one of the most effective methods for model compression. Previous studies have focused on the student model effectively training the predictive distribution of the teacher model. However, during training, the student model may inherit the shortcomings of the teacher model, which can lead to a decline in generalization capacity. To mitigate this issue, we propose adaptive self-knowledge distillation (ASKD), which dynamically reduces the dependence of the teacher model to improve the self-training capacity, and performs the self-knowledge distillation method to improve the generalization capacity of the student model. We further distill the Whisper model into a smaller variant, called FastWhisper. In our post-training setting, FastWhisper achieved a word error rate of 1.07% lower than the teacher model Whisper, and its relative inference time was 5 times faster.
Read more →

Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

arXiv:2601.19921v1 Announce Type: cross Abstract: Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others' confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.
Read more →

HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

arXiv:2601.19922v1 Announce Type: cross Abstract: Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators. All assessments follow a rubric grounded in interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following. HEART uncovers striking behavioral patterns. Several frontier models approach or surpass the average human responses in perceived empathy and consistency. At the same time, humans maintain advantages in adaptive reframing, tension-naming, and nuanced tone shifts, particularly in adversarial turns. Human and LLM-as-judge preferences align on about 80 percent of pairwise comparisons, matching inter-human agreement, and their written rationales emphasize similar HEART dimensions. This pattern suggests an emerging convergence in the criteria used to assess supportive quality. By placing humans and models on equal footing, HEART reframes supportive dialogue as a distinct capability axis, separable from general reasoning or linguistic fluency. It provides a unified empirical foundation for understanding where model-generated support aligns with human social judgment, where it diverges, and how affective conversational competence scales with model size.
Read more →

Table-BiEval: A Self-Supervised, Dual-Track Framework for Decoupling Structure and Content in LLM Evaluation

arXiv:2601.19923v1 Announce Type: cross Abstract: As Large Language Models (LLMs) evolve into autonomous agents, the capability to faithfully translate natural language into rigorous structured formats-essential for tool invocation-and to convert complex tabular information into machine-readable specifications has become paramount. However, current evaluations lack effective methodologies to measure this structural fidelity without costly human intervention, as traditional text metrics fail to detect semantic drift in code-like outputs. This paper proposes Table-BiEval, a novel approach based on a human-free, self-supervised evaluation framework, to assess LLMs performance quantitatively. By leveraging deterministic Intermediate Representations, our framework calculates Content Semantic Accuracy and Normalized Tree Edit Distance to decouple structure from content. Also, it empirically evaluates 15 state-of-the-art LLMs across dual topological dimensions-hierarchical structures and flat tables. The results reveal substantial variability, highlighting that mid-sized models can surprisingly outperform larger counterparts in structural efficiency and confirming that deep recursive nesting remains a universal bottleneck for current architectures.
Read more →

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

arXiv:2601.19924v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT-ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed-Integer Programming. Utilizing OPT-ENGINE, we conduct an extensive study of LLMs' reasoning capabilities, addressing two critical questions: 1.) Do LLMs' performance remain robust when generalizing to out-of-distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool-integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure-text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next-generation LLMs for advanced optimization. Our code is publicly available at \textcolor{blue}{https://github.com/Cardinal-Operations/OPTEngine}.
Read more →

Evaluating Large Language Models for Abstract Evaluation Tasks: An Empirical Study

arXiv:2601.19925v1 Announce Type: cross Abstract: Introduction: Large language models (LLMs) can process requests and generate texts, but their feasibility for assessing complex academic content needs further investigation. To explore LLM's potential in assisting scientific review, this study examined ChatGPT-5, Gemini-3-Pro, and Claude-Sonnet-4.5's consistency and reliability in evaluating abstracts compared to one another and to human reviewers. Methods: 160 abstracts from a local conference were graded by human reviewers and three LLMs using one rubric. Composite score distributions across three LLMs and fourteen reviewers were examined. Inter-rater reliability was calculated using intraclass correlation coefficients (ICCs) for within-AI reliability and AI-human concordance. Bland-Altman plots were examined for visual agreement patterns and systematic bias. Results: LLMs achieved good-to-excellent agreement with each other (ICCs: 0.59-0.87). ChatGPT and Claude reached moderate agreement with human reviewers on overall quality and content-specific criteria, with ICCs ~.45-.60 for composite, impression, clarity, objective, and results. They exhibited fair agreement on subjective dimensions, with ICC ranging from 0.23-0.38 for impact, engagement, and applicability. Gemini showed fair agreement on half criteria and no reliability on impact and applicability. Three LLMs showed acceptable or negligible mean difference (ChatGPT=0.24, Gemini=0.42, Claude=-0.02) from the human mean composite scores. Discussion: LLMs could process abstracts in batches with moderate agreement with human experts on overall quality and objective criteria. With appropriate process architecture, they can apply a rubric consistently across volumes of abstracts exceeding feasibility for a human rater. The weaker performance on subjective dimensions indicates that AI should serve a complementary role in evaluation, while human expertise remains essential.
Read more →

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

arXiv:2601.19926v1 Announce Type: cross Abstract: We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models, reporting on 1,015 model results from a range of syntactic phenomena and interpretability methods. Our analysis shows that the state of the art presents a healthy variety of methods and data, but an over-focus on a single language (English), a single model (BERT), and phenomena that are easy to get at (like part of speech and agreement). Results also suggest that TLMs capture these form-oriented phenomena well, but show more variable and weaker performance on phenomena at the syntax-semantics interface, like binding or filler-gap dependencies. We provide recommendations for future work, in particular reporting complete data, better aligning theoretical constructs and methods across studies, increasing the use of mechanistic methods, and broadening the empirical scope regarding languages and linguistic phenomena.
Read more →

Stingy Context: 18:1 Hierarchical Code Compression for LLM Auto-Coding

arXiv:2601.19929v1 Announce Type: cross Abstract: We introduce Stingy Context, a hierarchical tree-based compression scheme achieving 18:1 reduction in LLM context for auto-coding tasks. Using our TREEFRAG exploit decomposition, we reduce a real source code base of 239k tokens to 11k tokens while preserving task fidelity. Empirical results across 12 Frontier models show 94 to 97% success on 40 real-world issues at low cost, outperforming flat methods and mitigating lost-in-the-middle effects.
Read more →

SDUs DAISY: A Benchmark for Danish Culture

arXiv:2601.19930v1 Announce Type: cross Abstract: We introduce a new benchmark for Danish culture via cultural heritage, Daisy, based on the curated topics from the Danish Culture Canon 2006. For each artifact in the culture canon, we query the corresponding Wikipedia page and have a language model generate random questions. This yields a sampling strategy within each work, with a mix of central of peripheral questions for each work, not only knowledge of mainstream information, but also in-depth cornerstones defining the heritage of Danish Culture, defined by the Canon committee. Each question-answer pair is humanly approved or corrected in the final dataset consisting of 741 close-ended question answer pairs covering topics, from 1300 BC. archaeological findings, 1700 century poems and musicals pieces to contemporary pop music and Danish design and architecture.
Read more →

Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle

arXiv:2601.19933v1 Announce Type: cross Abstract: Non-Resolution Reasoning (NRR) provides a formal framework for maintaining semantic ambiguity rather than forcing premature interpretation collapse. While the foundational architecture establishes state spaces and operators for ambiguity-preserving computation, the critical question of how natural language maps to these mathematical structures remains open. This paper introduces the text-to-state mapping function {\phi} that transforms linguistic input into superposition states within the NRR framework. We formalize the Contradiction-Preservation Principle, which requires that genuinely ambiguous expressions maintain non-zero entropy in their state representations, and develop extraction protocols using existing Large Language Models as interpretation generators. Empirical validation across 68 test sentences spanning lexical, structural, and pragmatic ambiguity demonstrates that our mapping achieves mean Shannon entropy H(S) = 1.087 bits for ambiguous inputs while baseline single-interpretation approaches yield H(S) = 0.000. The framework provides the missing algorithmic bridge between raw text and the formal state spaces on which NRR operators act, enabling architectural collapse deferment in language model inference.
Read more →

Quantifying non deterministic drift in large language models

arXiv:2601.19934v1 Announce Type: cross Abstract: Large language models (LLMs) are widely used for tasks ranging from summarisation to decision support. In practice, identical prompts do not always produce identical outputs, even when temperature and other decoding parameters are fixed. In this work, we conduct repeated-run experiments to empirically quantify baseline behavioural drift, defined as output variability observed when the same prompt is issued multiple times under operator-free conditions. We evaluate two publicly accessible models, gpt-4o-mini and llama3.1-8b, across five prompt categories using exact repeats, perturbed inputs, and reuse modes at temperatures of 0.0 and 0.7. Drift is measured using unique output fractions, lexical similarity, and word count statistics, enabling direct comparison across models, prompting modes, and deployment types. The results show that nondeterminism persists even at temperature 0.0, with distinct variability patterns by model size, deployment, and prompt type. We situate these findings within existing work on concept drift, behavioural drift, and infrastructure-induced nondeterminism, discuss the limitations of lexical metrics, and highlight emerging semantic approaches. By establishing a systematic empirical baseline in the absence of stabilisation techniques, this study provides a reference point for evaluating future drift mitigation and control methods.
Read more →

Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

arXiv:2601.19935v1 Announce Type: cross Abstract: Large Language Model (LLM)-based agents are increasingly deployed for complex, tool-based tasks where long-term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent's ability to passively retrieve isolated facts in response to explicit questions. They fail to evaluate the more crucial capability of actively applying memory to execute tasks. To address this gap, we introduce \textsc{Mem2ActBench}, a benchmark for evaluating whether agents can proactively leverage long-term memory to execute tool-based actions by selecting appropriate tools and grounding their parameters. The benchmark simulates persistent assistant usage, where users mention the same topic across long, interrupted interactions and expect previously established preferences and task states to be implicitly applied. We build the dataset with an automated pipeline that merges heterogeneous sources (ToolACE, BFCL, Oasst1), resolves conflicts via consistency modeling, and synthesizes 2,029 sessions with 12 user--assistant--tool turns on average. From these memory chains, a reverse-generation method produces 400 tool-use tasks, with human evaluation confirming 91.3\% are strongly memory-dependent. Experiments on seven memory frameworks show that current systems remain inadequate at actively utilizing memory for parameter grounding, highlighting the need for more effective approaches to evaluate and improve memory application in task execution.
Read more →

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

arXiv:2601.19936v1 Announce Type: cross Abstract: The opacity of massive pretraining corpora in Large Language Models (LLMs) raises significant privacy and copyright concerns, making pretraining data detection a critical challenge. Existing state-of-the-art methods typically rely on token likelihoods, yet they often overlook the divergence from the model's top-1 prediction and local correlation between adjacent tokens. In this work, we propose Gap-K%, a novel pretraining data detection method grounded in the optimization dynamics of LLM pretraining. By analyzing the next-token prediction objective, we observe that discrepancies between the model's top-1 prediction and the target token induce strong gradient signals, which are explicitly penalized during training. Motivated by this, Gap-K% leverages the log probability gap between the top-1 predicted token and the target token, incorporating a sliding window strategy to capture local correlations and mitigate token-level fluctuations. Extensive experiments on the WikiMIA and MIMIR benchmarks demonstrate that Gap-K% achieves state-of-the-art performance, consistently outperforming prior baselines across various model sizes and input lengths.
Read more →

DecHW: Heterogeneous Decentralized Federated Learning Exploiting Second-Order Information

arXiv:2601.19938v1 Announce Type: cross Abstract: Decentralized Federated Learning (DFL) is a serverless collaborative machine learning paradigm where devices collaborate directly with neighbouring devices to exchange model information for learning a generalized model. However, variations in individual experiences and different levels of device interactions lead to data and model initialization heterogeneities across devices. Such heterogeneities leave variations in local model parameters across devices that leads to slower convergence. This paper tackles the data and model heterogeneity by explicitly addressing the parameter level varying evidential credence across local models. A novel aggregation approach is introduced that captures these parameter variations in local models and performs robust aggregation of neighbourhood local updates. Specifically, consensus weights are generated via approximation of second-order information of local models on their local datasets. These weights are utilized to scale neighbourhood updates before aggregating them into global neighbourhood representation. In extensive experiments with computer vision tasks, the proposed approach shows strong generalizability of local models at reduced communication costs.
Read more →

Continuous-Flow Data-Rate-Aware CNN Inference on FPGA

arXiv:2601.19940v1 Announce Type: cross Abstract: Among hardware accelerators for deep-learning inference, data flow implementations offer low latency and high throughput capabilities. In these architectures, each neuron is mapped to a dedicated hardware unit, making them well-suited for field-programmable gate array (FPGA) implementation. Previous unrolled implementations mostly focus on fully connected networks because of their simplicity, although it is well known that convolutional neural networks (CNNs) require fewer computations for the same accuracy. When observing the data flow in CNNs, pooling layers and convolutional layers with a stride larger than one, the number of data at their output is reduced with respect to their input. This data reduction strongly affects the data rate in a fully parallel implementation, making hardware units heavily underutilized unless it is handled properly. This work addresses this issue by analyzing the data flow of CNNs and presents a novel approach to designing data-rate-aware, continuous-flow CNN architectures. The proposed approach ensures a high hardware utilization close to 100% by interleaving low data rate signals and sharing hardware units, as well as using the right parallelization to achieve the throughput of a fully parallel implementation. The results show that a significant amount of the arithmetic logic can be saved, which allows implementing complex CNNs like MobileNet on a single FPGA with high throughput.
Read more →

Bench4HLS: End-to-End Evaluation of LLMs in High-Level Synthesis Code Generation

arXiv:2601.19941v1 Announce Type: cross Abstract: In last two years, large language models (LLMs) have shown strong capabilities in code generation, including hardware design at register-transfer level (RTL). While their use in high-level synthesis (HLS) remains comparatively less mature, the ratio of HLS- to RTL-focused studies has shifted from 1:10 to 2:10 in the past six months, indicating growing interest in leveraging LLMs for high-level design entry while relying on downstream synthesis for optimization. This growing trend highlights the need for a comprehensive benchmarking and evaluation framework dedicated to LLM-based HLS. To address this, We present Bench4HLS for evaluating LLM-generated HLS designs. Bench4HLS comprises 170 manually drafted and validated case studies, spanning small kernels to complex accelerators, curated from widely used public repositories. The framework supports fully automated assessment of compilation success, functional correctness via simulation, and synthesis feasibility/optimization. Crucially, Bench4HLS integrates a pluggable API for power, performance, and area (PPA) analysis across various HLS toolchains and architectures, demonstrated here with Xilinx Vitis HLS and validated on Catapult HLS. By providing a structured, extensible, and plug-and-play testbed, Bench4HLS establishes a foundational methodology for benchmarking LLMs in HLS workflows.
Read more →

Benchmarking von ASR-Modellen im deutschen medizinischen Kontext: Eine Leistungsanalyse anhand von Anamnesegespr\"achen

arXiv:2601.19945v1 Announce Type: cross Abstract: Automatic Speech Recognition (ASR) offers significant potential to reduce the workload of medical personnel, for example, through the automation of documentation tasks. While numerous benchmarks exist for the English language, specific evaluations for the German-speaking medical context are still lacking, particularly regarding the inclusion of dialects. In this article, we present a curated dataset of simulated doctor-patient conversations and evaluate a total of 29 different ASR models. The test field encompasses both open-weights models from the Whisper, Voxtral, and Wav2Vec2 families as well as commercial state-of-the-art APIs (AssemblyAI, Deepgram). For evaluation, we utilize three different metrics (WER, CER, BLEU) and provide an outlook on qualitative semantic analysis. The results demonstrate significant performance differences between the models: while the best systems already achieve very good Word Error Rates (WER) of partly below 3%, the error rates of other models, especially concerning medical terminology or dialect-influenced variations, are considerably higher.
Read more →

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

arXiv:2601.19947v1 Announce Type: cross Abstract: Learning from Noisy Labels (LNL) presents a fundamental challenge in deep learning, as real-world datasets often contain erroneous or corrupted annotations, \textit{e.g.}, data crawled from Web. Current research focuses on sophisticated label correction mechanisms. In contrast, this paper adopts a novel perspective by establishing a theoretical analysis the relationship between flatness of the loss landscape and the presence of label noise. In this paper, we theoretically demonstrate that carefully simulated label noise synergistically enhances both the generalization performance and robustness of label noises. Consequently, we propose Noise-Compensated Sharpness-aware Minimization (NCSAM) to leverage the perturbation of Sharpness-Aware Minimization (SAM) to remedy the damage of label noises. Our analysis reveals that the testing accuracy exhibits a similar behavior that has been observed on the noise-clear dataset. Extensive experimental results on multiple benchmark datasets demonstrate the consistent superiority of the proposed method over existing state-of-the-art approaches on diverse tasks.
Read more →

LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning

arXiv:2601.19952v1 Announce Type: cross Abstract: Real-time voice agents face a dilemma: end-to-end models often lack deep reasoning, while cascaded pipelines incur high latency by executing ASR, LLM reasoning, and TTS strictly in sequence, unlike human conversation where listeners often start thinking before the speaker finishes. Since cascaded architectures remain the dominant choice for complex tasks, existing cascaded streaming strategies attempt to reduce this latency via mechanical segmentation (e.g., fixed chunks, VAD-based splitting) or speculative generation, but they frequently either break semantic units or waste computation on predictions that must be rolled back. To address these challenges, we propose LTS-VoiceAgent, a Listen-Think-Speak framework that explicitly separates when to think from how to reason incrementally. It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream Orchestrator that coordinates a background Thinker (for state maintenance) and a foreground Speaker (for speculative solving). This parallel design enables "thinking while speaking" without blocking responses. We also introduce a Pause-and-Repair benchmark containing natural disfluencies to stress-test streaming robustness. Experiments across VERA, Spoken-MQA, BigBenchAudio, and our benchmark show that LTS-VoiceAgent achieves a stronger accuracy-latency-efficiency trade-off than serial cascaded baselines and existing streaming strategies.
Read more →

Probabilistic Sensing: Intelligence in Data Sampling

arXiv:2601.19953v1 Announce Type: cross Abstract: Extending the intelligence of sensors to the data-acquisition process - deciding whether to sample or not - can result in transformative energy-efficiency gains. However, making such a decision in a deterministic manner involves risk of losing information. Here we present a sensing paradigm that enables making such a decision in a probabilistic manner. The paradigm takes inspiration from the autonomous nervous system and employs a probabilistic neuron (p-neuron) driven by an analog feature extraction circuit. The response time of the system is on the order of microseconds, over-coming the sub-sampling-rate response time limit and enabling real-time intelligent autonomous activation of data-sampling. Validation experiments on active seismic survey data demonstrate lossless probabilistic data acquisition, with a normalized mean squared error of 0.41%, and 93% saving in the active operation time of the system and the number of generated samples.
Read more →

VoxPrivacy: A Benchmark for Evaluating Interactional Privacy of Speech Language Models

arXiv:2601.19956v1 Announce Type: cross Abstract: As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user's confidential schedule to another, a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextual privacy-sensitive information (e.g., a user's private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most open-source models perform close to random chance (around 50% accuracy) on conditional privacy decisions, while even strong closed-source systems fall short on proactive privacy inference. We further validate these findings on Real-VoxPrivacy, a human-recorded subset, confirming that failures observed on synthetic data persist in real speech. Finally, we demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve privacy-preserving abilities while maintaining robustness. To support future work, we release the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to foster the development of safer and more context-aware SLMs.
Read more →

Do we really need Self-Attention for Streaming Automatic Speech Recognition?

arXiv:2601.19960v1 Announce Type: cross Abstract: Transformer-based architectures are the most used architectures in many deep learning fields like Natural Language Processing, Computer Vision or Speech processing. It may encourage the direct use of Transformers in the constrained tasks, without questioning whether it will yield the same benefits as in standard tasks. Given specific constraints, it is essential to evaluate the relevance of transformer models. This work questions the suitability of transformers for specific domains. We argue that the high computational requirements and latency issues associated with these models do not align well with streaming applications. Our study promotes the search for alternative strategies to improve efficiency without sacrificing performance. In light of this observation, our paper critically examines the usefulness of transformer architecture in such constrained environments. As a first attempt, we show that the computational cost for Streaming Automatic Speech Recognition (ASR) can be reduced using deformable convolution instead of Self-Attention. Furthermore, we show that Self-Attention mechanisms can be entirely removed and not replaced, without observing significant degradation in the Word Error Rate.
Read more →

MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference

arXiv:2601.19961v1 Announce Type: cross Abstract: We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached Jacobian--vector products (JVP) to construct interval average velocities from instantaneous velocities, it effectively mitigates local error accumulation. To further improve cache timing and JVP reuse stability, we develop a trajectory-stability scheduling strategy as a practical tool, employing a Peak-Suppressed Shortest Path under budget constraints to determine the schedule. Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate that MeanCache achieves 4.12X and 4.56X and 3.59X acceleration, respectively, while consistently outperforming state-of-the-art caching baselines in generation quality. We believe this simple yet effective approach provides a new perspective for Flow Matching inference and will inspire further exploration of stability-driven acceleration in commercial-scale generative models.
Read more →

Cross-Session Decoding of Neural Spiking Data via Task-Conditioned Latent Alignment

arXiv:2601.19963v1 Announce Type: cross Abstract: Cross-session nonstationarity in neural activity recorded by implanted electrodes is a major challenge for invasive Brain-computer interfaces (BCIs), as decoders trained on data from one session often fail to generalize to subsequent sessions. This issue is further exacerbated in practice, as retraining or adapting decoders becomes particularly challenging when only limited data are available from a new session. To address this challenge, we propose a Task-Conditioned Latent Alignment framework (TCLA) for cross-session neural decoding. Building upon an autoencoder architecture, TCLA first learns a low-dimensional representation of neural dynamics from a source session with sufficient data. For target sessions with limited data, TCLA then aligns target latent representations to the source in a task-conditioned manner, enabling effective transfer of learned neural dynamics. We evaluate TCLA on the macaque motor and oculomotor center-out dataset. Compared to baseline methods trained solely on target-session data, TCLA consistently improves decoding performance across datasets and decoding settings, with gains in the coefficient of determination of up to 0.386 for y coordinate velocity decoding in a motor dataset. These results suggest that TCLA provides an effective strategy for transferring knowledge from source to target sessions, enabling more robust neural decoding under conditions with limited data.
Read more →

Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers

arXiv:2601.19967v1 Announce Type: cross Abstract: Collecting web data to train deep models has become increasingly common, raising concerns about unauthorized data usage. To mitigate this issue, unlearnable examples introduce imperceptible perturbations into data, preventing models from learning effectively. However, existing methods typically rely on deep neural networks as surrogate models for perturbation generation, resulting in significant computational costs. In this work, we propose Perturbation-Induced Linearization (PIL), a computationally efficient yet effective method that generates perturbations using only linear surrogate models. PIL achieves comparable or better performance than existing surrogate-based methods while reducing computational time dramatically. We further reveal a key mechanism underlying unlearnable examples: inducing linearization to deep models, which explains why PIL can achieve competitive results in a very short time. Beyond this, we provide an analysis about the property of unlearnable examples under percentage-based partial perturbation. Our work not only provides a practical approach for data protection but also offers insights into what makes unlearnable examples effective.
Read more →

On the Effectiveness of LLM-Specific Fine-Tuning for Detecting AI-Generated Text

arXiv:2601.20006v1 Announce Type: cross Abstract: The rapid progress of large language models has enabled the generation of text that closely resembles human writing, creating challenges for authenticity verification in education, publishing, and digital security. Detecting AI-generated text has therefore become a crucial technical and ethical issue. This paper presents a comprehensive study of AI-generated text detection based on large-scale corpora and novel training strategies. We introduce a 1-billion-token corpus of human-authored texts spanning multiple genres and a 1.9-billion-token corpus of AI-generated texts produced by prompting a variety of LLMs across diverse domains. Using these resources, we develop and evaluate numerous detection models and propose two novel training paradigms: Per LLM and Per LLM family fine-tuning. Across a 100-million-token benchmark covering 21 large language models, our best fine-tuned detector achieves up to $99.6\%$ token-level accuracy, substantially outperforming existing open-source baselines.
Read more →

LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them?

arXiv:2601.20009v1 Announce Type: cross Abstract: Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control, the ability to respond in the intended language. We identify and characterize two key failure modes: the multilingual transfer bottleneck (correct language, incorrect task response) and the language consistency bottleneck (correct task response, wrong language). To systematically surface these issues, we design a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks. To probe these issues with interpretability, we extend logit lens analysis to track language probabilities layer by layer and compute cross-lingual semantic similarity of hidden states. The results reveal a three-phase internal structure: early layers align inputs into a shared semantic space, middle layers perform task reasoning, and late layers drive language-specific generation. Guided by these insights, we introduce selective fine-tuning of only the final layers responsible for language control. On Qwen-3-32B and Bloom-7.1B, this method achieves over 98 percent language consistency across six languages while fine-tuning only 3-5 percent of parameters, without sacrificing task accuracy. Importantly, this result is nearly identical to that of full-scope fine-tuning (for example, above 98 percent language consistency for both methods across all prompt scenarios) but uses a fraction of the computational resources. To the best of our knowledge, this is the first approach to leverage layer-localization of language control for efficient multilingual adaptation.
Read more →

Structural Compositional Function Networks: Interpretable Functional Compositions for Tabular Discovery

arXiv:2601.20037v1 Announce Type: cross Abstract: Despite the ubiquity of tabular data in high-stakes domains, traditional deep learning architectures often struggle to match the performance of gradient-boosted decision trees while maintaining scientific interpretability. Standard neural networks typically treat features as independent entities, failing to exploit the inherent manifold structural dependencies that define tabular distributions. We propose Structural Compositional Function Networks (StructuralCFN), a novel architecture that imposes a Relation-Aware Inductive Bias via a differentiable structural prior. StructuralCFN explicitly models each feature as a mathematical composition of its counterparts through Differentiable Adaptive Gating, which automatically discovers the optimal activation physics (e.g., attention-style filtering vs. inhibitory polarity) for each relationship. Our framework enables Structured Knowledge Integration, allowing domain-specific relational priors to be injected directly into the architecture to guide discovery. We evaluate StructuralCFN across a rigorous 10-fold cross-validation suite on 18 benchmarks, demonstrating statistically significant improvements (p < 0.05) on scientific and clinical datasets (e.g., Blood Transfusion, Ozone, WDBC). Furthermore, StructuralCFN provides Intrinsic Symbolic Interpretability: it recovers the governing "laws" of the data manifold as human-readable mathematical expressions while maintaining a compact parameter footprint (300--2,500 parameters) that is over an order of magnitude (10x--20x) smaller than standard deep baselines.
Read more →

CiMRAG: Cim-Aware Domain-Adaptive and Noise-Resilient Retrieval-Augmented Generation for Edge-Based LLMs

arXiv:2601.20041v1 Announce Type: cross Abstract: Personalized virtual assistants powered by large language models (LLMs) on edge devices are attracting growing attention, with Retrieval-Augmented Generation (RAG) emerging as a key method for personalization by retrieving relevant profile data and generating tailored responses. However, deploying RAG on edge devices faces efficiency hurdles due to the rapid growth of profile data, such as user-LLM interactions and recent updates. While Computing-in-Memory (CiM) architectures mitigate this bottleneck by eliminating data movement between memory and processing units via in-situ operations, they are susceptible to environmental noise that can degrade retrieval precision. This poses a critical issue in dynamic, multi-domain edge-based scenarios (e.g., travel, medicine, and law) where both accuracy and adaptability are paramount. To address these challenges, we propose Task-Oriented Noise-resilient Embedding Learning (TONEL), a framework that improves noise robustness and domain adaptability for RAG in noisy edge environments. TONEL employs a noise-aware projection model to learn task-specific embeddings compatible with CiM hardware constraints, enabling accurate retrieval under noisy conditions. Extensive experiments conducted on personalization benchmarks demonstrate the effectiveness and practicality of our methods relative to strong baselines, especially in task-specific noisy scenarios.
Read more →

Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation

arXiv:2601.20051v1 Announce Type: cross Abstract: The rise of chronic diseases related to diet, such as obesity and diabetes, emphasizes the need for accurate monitoring of food intake. While AI-driven dietary assessment has made strides in recent years, the ill-posed nature of recovering size (portion) information from monocular images for accurate estimation of ``how much did you eat?'' is a pressing challenge. Some 3D reconstruction methods have achieved impressive geometric reconstruction but fail to recover the crucial real-world scale of the reconstructed object, limiting its usage in precision nutrition. In this paper, we bridge the gap between 3D computer vision and digital health by proposing a method that recovers a true-to-scale 3D reconstructed object from a monocular image. Our approach leverages rich visual features extracted from models trained on large-scale datasets to estimate the scale of the reconstructed object. This learned scale enables us to convert single-view 3D reconstructions into true-to-life, physically meaningful models. Extensive experiments and ablation studies on two publicly available datasets show that our method consistently outperforms existing techniques, achieving nearly a 30% reduction in mean absolute volume-estimation error, showcasing its potential to enhance the domain of precision nutrition. Code: https://gitlab.com/viper-purdue/size-matters
Read more →

VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning

arXiv:2601.20055v1 Announce Type: cross Abstract: Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers to produce verification-guided answers through iterative refinement. Our approach decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies their logical consistency using automated theorem proving. We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking to ensure logic-level alignment between candidates, eliminating the syntactic bias of surface-form metrics, (2) semantic routing that directs different claim types to appropriate verification strategies: symbolic solvers for logical claims and LLM ensembles for commonsense reasoning, and (3) precise logical error localization via Minimal Correction Subsets (MCS), which pinpoint the exact subset of claims to revise, transforming binary failure signals into actionable feedback. Our framework classifies claims by their logical status and aggregates multiple verification signals into a unified score with variance-based penalty. The system iteratively refines answers using structured feedback until acceptance criteria are met or convergence is achieved. This hybrid approach delivers formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI. With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
Read more →

Semi-Supervised Masked Autoencoders: Unlocking Vision Transformer Potential with Limited Data

arXiv:2601.20072v1 Announce Type: cross Abstract: We address the challenge of training Vision Transformers (ViTs) when labeled data is scarce but unlabeled data is abundant. We propose Semi-Supervised Masked Autoencoder (SSMAE), a framework that jointly optimizes masked image reconstruction and classification using both unlabeled and labeled samples with dynamically selected pseudo-labels. SSMAE introduces a validation-driven gating mechanism that activates pseudo-labeling only after the model achieves reliable, high-confidence predictions that are consistent across both weakly and strongly augmented views of the same image, reducing confirmation bias. On CIFAR-10 and CIFAR-100, SSMAE consistently outperforms supervised ViT and fine-tuned MAE, with the largest gains in low-label regimes (+9.24% over ViT on CIFAR-10 with 10% labels). Our results demonstrate that when pseudo-labels are introduced is as important as how they are generated for data-efficient transformer training. Codes are available at https://github.com/atik666/ssmae.
Read more →

LLaTTE: Scaling Laws for Multi-Stage Sequence Modeling in Large-Scale Ads Recommendation

arXiv:2601.20083v1 Announce Type: cross Abstract: We present LLaTTE (LLM-Style Latent Transformers for Temporal Events), a scalable transformer architecture for production ads recommendation. Through systematic experiments, we demonstrate that sequence modeling in recommendation systems follows predictable power-law scaling similar to LLMs. Crucially, we find that semantic features bend the scaling curve: they are a prerequisite for scaling, enabling the model to effectively utilize the capacity of deeper and longer architectures. To realize the benefits of continued scaling under strict latency constraints, we introduce a two-stage architecture that offloads the heavy computation of large, long-context models to an asynchronous upstream user model. We demonstrate that upstream improvements transfer predictably to downstream ranking tasks. Deployed as the largest user model at Meta, this multi-stage framework drives a 4.3\% conversion uplift on Facebook Feed and Reels with minimal serving overhead, establishing a practical blueprint for harnessing scaling laws in industrial recommender systems.
Read more →

Dynamics of Human-AI Collective Knowledge on the Web: A Scalable Model and Insights for Sustainable Growth

arXiv:2601.20099v1 Announce Type: cross Abstract: Humans and large language models (LLMs) now co-produce and co-consume the web's shared knowledge archives. Such human-AI collective knowledge ecosystems contain feedback loops with both benefits (e.g., faster growth, easier learning) and systemic risks (e.g., quality dilution, skill reduction, model collapse). To understand such phenomena, we propose a minimal, interpretable dynamical model of the co-evolution of archive size, archive quality, model (LLM) skill, aggregate human skill, and query volume. The model captures two content inflows (human, LLM) controlled by a gate on LLM-content admissions, two learning pathways for humans (archive study vs. LLM assistance), and two LLM-training modalities (corpus-driven scaling vs. learning from human feedback). Through numerical experiments, we identify different growth regimes (e.g., healthy growth, inverted flow, inverted learning, oscillations), and show how platform and policy levers (gate strictness, LLM training, human learning pathways) shift the system across regime boundaries. Two domain configurations (PubMed, GitHub and Copilot) illustrate contrasting steady states under different growth rates and moderation norms. We also fit the model to Wikipedia's knowledge flow during pre-ChatGPT and post-ChatGPT eras separately. We find a rise in LLM additions with a concurrent decline in human inflow, consistent with a regime identified by the model. Our model and analysis yield actionable insights for sustainable growth of human-AI collective knowledge on the Web.
Read more →

Taming Toxic Talk: Using chatbots to intervene with users posting toxic comments

arXiv:2601.20100v1 Announce Type: cross Abstract: Generative AI chatbots have proven surprisingly effective at persuading people to change their beliefs and attitudes in lab settings. However, the practical implications of these findings are not yet clear. In this work, we explore the impact of rehabilitative conversations with generative AI chatbots on users who share toxic content online. Toxic behaviors -- like insults or threats of violence, are widespread in online communities. Strategies to deal with toxic behavior are typically punitive, such as removing content or banning users. Rehabilitative approaches are rarely attempted, in part due to the emotional and psychological cost of engaging with aggressive users. In collaboration with seven large Reddit communities, we conducted a large-scale field experiment (N=893) to invite people who had recently posted toxic content to participate in conversations with AI chatbots. A qualitative analysis of the conversations shows that many participants engaged in good faith and even expressed remorse or a desire to change. However, we did not observe a significant change in toxic behavior in the following month compared to a control group. We discuss possible explanations for our findings, as well as theoretical and practical implications based on our results.
Read more →

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

arXiv:2601.20103v1 Announce Type: cross Abstract: Recent advances in reinforcement learning for code generation have made robust environments essential to prevent reward hacking. As LLMs increasingly serve as evaluators in code-based RL, their ability to detect reward hacking remains understudied. In this paper, we propose a novel taxonomy of reward exploits spanning across 54 categories and introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories. Unlike prior work that evaluates reward hack detection in isolated classification scenarios, we contrast these evaluations with a more realistic, contrastive anomaly detection setup on TRACE. Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings, with GPT-5.2 with highest reasoning mode achieving the best detection rate at 63%, up from 45% in isolated settings on TRACE. Building on this insight, we demonstrate that state-of-the-art models struggle significantly more with semantically contextualized reward hacks compared to syntactically contextualized ones. We further conduct qualitative analyses of model behaviors, as well as ablation studies showing that the ratio of benign to hacked trajectories and analysis cluster sizes substantially impact detection performance. We release the benchmark and evaluation harness to enable the community to expand TRACE and evaluate their models.
Read more →

How Much Progress Has There Been in NVIDIA Datacenter GPUs?

arXiv:2601.20115v1 Announce Type: cross Abstract: Graphics Processing Units (GPUs) are the state-of-the-art architecture for essential tasks, ranging from rendering 2D/3D graphics to accelerating workloads in supercomputing centers and, of course, Artificial Intelligence (AI). As GPUs continue improving to satisfy ever-increasing performance demands, analyzing past and current progress becomes paramount in determining future constraints on scientific research. This is particularly compelling in the AI domain, where rapid technological advancements and fierce global competition have led the United States to recently implement export control regulations limiting international access to advanced AI chips. For this reason, this paper studies technical progress in NVIDIA datacenter GPUs released from the mid-2000s until today. Specifically, we compile a comprehensive dataset of datacenter NVIDIA GPUs comprising several features, ranging from computational performance to release price. Then, we examine trends in main GPU features and estimate progress indicators for per-memory bandwidth, per-dollar, and per-watt increase rates. Our main results identify doubling times of 1.44 and 1.69 years for FP16 and FP32 operations (without accounting for sparsity benefits), while FP64 doubling times range from 2.06 to 3.79 years. Off-chip memory size and bandwidth grew at slower rates than computing performance, doubling every 3.32 to 3.53 years. The release prices of datacenter GPUs have roughly doubled every 5.1 years, while their power consumption has approximately doubled every 16 years. Finally, we quantify the potential implications of current U.S. export control regulations in terms of the potential performance gaps that would result if implementation were assumed to be complete and successful. We find that recently proposed changes to export controls would shrink the potential performance gap from 23.6x to 3.54x.
Read more →

Membership Inference Attacks Against Fine-tuned Diffusion Language Models

arXiv:2601.20125v1 Announce Type: cross Abstract: Diffusion Language Models (DLMs) represent a promising alternative to autoregressive language models, using bidirectional masked token prediction. Yet their susceptibility to privacy leakage via Membership Inference Attacks (MIA) remains critically underexplored. This paper presents the first systematic investigation of MIA vulnerabilities in DLMs. Unlike the autoregressive models' single fixed prediction pattern, DLMs' multiple maskable configurations exponentially increase attack opportunities. This ability to probe many independent masks dramatically improves detection chances. To exploit this, we introduce SAMA (Subset-Aggregated Membership Attack), which addresses the sparse signal challenge through robust aggregation. SAMA samples masked subsets across progressive densities and applies sign-based statistics that remain effective despite heavy-tailed noise. Through inverse-weighted aggregation prioritizing sparse masks' cleaner signals, SAMA transforms sparse memorization detection into a robust voting mechanism. Experiments on nine datasets show SAMA achieves 30% relative AUC improvement over the best baseline, with up to 8 times improvement at low false positive rates. These findings reveal significant, previously unknown vulnerabilities in DLMs, necessitating the development of tailored privacy defenses.
Read more →

Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models

arXiv:2601.20126v1 Announce Type: cross Abstract: Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention ("I don't know") alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ($-1$, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs $\approx -0.25$ to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic-Slice/rl-abstention.
Read more →

BengaliSent140: A Large-Scale Bengali Binary Sentiment Dataset for Hate and Non-Hate Speech Classification

arXiv:2601.20129v1 Announce Type: cross Abstract: Sentiment analysis for the Bengali language has attracted increasing research interest in recent years. However, progress remains constrained by the scarcity of large-scale and diverse annotated datasets. Although several Bengali sentiment and hate speech datasets are publicly available, most are limited in size or confined to a single domain, such as social media comments. Consequently, these resources are often insufficient for training modern deep learning based models, which require large volumes of heterogeneous data to learn robust and generalizable representations. In this work, we introduce BengaliSent140, a large-scale Bengali binary sentiment dataset constructed by consolidating seven existing Bengali text datasets into a unified corpus. To ensure consistency across sources, heterogeneous annotation schemes are systematically harmonized into a binary sentiment formulation with two classes: Not Hate (0) and Hate (1). The resulting dataset comprises 139,792 unique text samples, including 68,548 hate and 71,244 not-hate instances, yielding a relatively balanced class distribution. By integrating data from multiple sources and domains, BengaliSent140 offers broader linguistic and contextual coverage than existing Bengali sentiment datasets and provides a strong foundation for training and benchmarking deep learning models. Baseline experimental results are also reported to demonstrate the practical usability of the dataset. The dataset is publicly available at https://www.kaggle.com/datasets/akifislam/bengalisent140/
Read more →

Taxonomy of the Retrieval System Framework: Pitfalls and Paradigms

arXiv:2601.20131v1 Announce Type: cross Abstract: Designing an embedding retrieval system requires navigating a complex design space of conflicting trade-offs between efficiency and effectiveness. This work structures these decisions as a vertical traversal of the system design stack. We begin with the Representation Layer by examining how loss functions and architectures, specifically Bi-encoders and Cross-encoders, define semantic relevance and geometric projection. Next, we analyze the Granularity Layer and evaluate how segmentation strategies like Atomic and Hierarchical chunking mitigate information bottlenecks in long-context documents. Moving to the Orchestration Layer, we discuss methods that transcend the single-vector paradigm, including hierarchical retrieval, agentic decomposition, and multi-stage reranking pipelines to resolve capacity limitations. Finally, we address the Robustness Layer by identifying architectural mitigations for domain generalization failures, lexical blind spots, and the silent degradation of retrieval quality due to temporal drift. By categorizing these limitations and design choices, we provide a comprehensive framework for practitioners to optimize the efficiency-effectiveness frontier in modern neural search systems.
Read more →

Large language models accurately predict public perceptions of support for climate action worldwide

arXiv:2601.20141v1 Announce Type: cross Abstract: Although most people support climate action, widespread underestimation of others' support stalls individual and systemic changes. In this preregistered experiment, we test whether large language models (LLMs) can reliably predict these perception gaps worldwide. Using country-level indicators and public opinion data from 125 countries, we benchmark four state-of-the-art LLMs against Gallup World Poll 2021/22 data and statistical regressions. LLMs, particularly Claude, accurately capture public perceptions of others' willingness to contribute financially to climate action (MAE approximately 5 p.p.; r = .77), comparable to statistical models, though performance declines in less digitally connected, lower-GDP countries. Controlled tests show that LLMs capture the key psychological process - social projection with a systematic downward bias - and rely on structured reasoning rather than memorized values. Overall, LLMs provide a rapid tool for assessing perception gaps in climate action, serving as an alternative to costly surveys in resource-rich countries and as a complement in underrepresented populations.
Read more →

What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering

arXiv:2601.20164v1 Announce Type: cross Abstract: Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. "-ight") or answer to a question ("whale") can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.
Read more →

NeuraLSP: An Efficient and Rigorous Neural Left Singular Subspace Preconditioner for Conjugate Gradient Methods

arXiv:2601.20174v1 Announce Type: cross Abstract: Numerical techniques for solving partial differential equations (PDEs) are integral for many fields across science and engineering. Such techniques usually involve solving large, sparse linear systems, where preconditioning methods are critical. In recent years, neural methods, particularly graph neural networks (GNNs), have demonstrated their potential through accelerated convergence. Nonetheless, to extract connective structures, existing techniques aggregate discretized system matrices into graphs, and suffer from rank inflation and a suboptimal convergence rate. In this paper, we articulate NeuraLSP, a novel neural preconditioner combined with a novel loss metric that leverages the left singular subspace of the system matrix's near-nullspace vectors. By compressing spectral information into a fixed low-rank operator, our method exhibits both theoretical guarantees and empirical robustness to rank inflation, affording up to a 53% speedup. Besides the theoretical guarantees for our newly-formulated loss function, our comprehensive experimental results across diverse families of PDEs also substantiate the aforementioned theoretical advances.
Read more →

Causal-Driven Feature Evaluation for Cross-Domain Image Classification

arXiv:2601.20176v1 Announce Type: cross Abstract: Out-of-distribution (OOD) generalization remains a fundamental challenge in real-world classification, where test distributions often differ substantially from training data. Most existing approaches pursue domain-invariant representations, implicitly assuming that invariance implies reliability. However, features that are invariant across domains are not necessarily causally effective for prediction. In this work, we revisit OOD classification from a causal perspective and propose to evaluate learned representations based on their necessity and sufficiency under distribution shift. We introduce an explicit segment-level framework that directly measures causal effectiveness across domains, providing a more faithful criterion than invariance alone. Experiments on multi-domain benchmarks demonstrate consistent improvements in OOD performance, particularly under challenging domain shifts, highlighting the value of causal evaluation for robust generalization.
Read more →

Meta-Cognitive Reinforcement Learning with Self-Doubt and Recovery

arXiv:2601.20193v1 Announce Type: cross Abstract: Robust reinforcement learning methods typically focus on suppressing unreliable experiences or corrupted rewards, but they lack the ability to reason about the reliability of their own learning process. As a result, such methods often either overreact to noise by becoming overly conservative or fail catastrophically when uncertainty accumulates. In this work, we propose a meta-cognitive reinforcement learning framework that enables an agent to assess, regulate, and recover its learning behavior based on internally estimated reliability signals. The proposed method introduces a meta-trust variable driven by Value Prediction Error Stability (VPES), which modulates learning dynamics via fail-safe regulation and gradual trust recovery. Experiments on continuous-control benchmarks with reward corruption demonstrate that recovery-enabled meta-cognitive control achieves higher average returns and significantly reduces late-stage training failures compared to strong robustness baselines.
Read more →

ProFlow: Zero-Shot Physics-Consistent Sampling via Proximal Flow Guidance

arXiv:2601.20227v1 Announce Type: cross Abstract: Inferring physical fields from sparse observations while strictly satisfying partial differential equations (PDEs) is a fundamental challenge in computational physics. Recently, deep generative models offer powerful data-driven priors for such inverse problems, yet existing methods struggle to enforce hard physical constraints without costly retraining or disrupting the learned generative prior. Consequently, there is a critical need for a sampling mechanism that can reconcile strict physical consistency and observational fidelity with the statistical structure of the pre-trained prior. To this end, we present ProFlow, a proximal guidance framework for zero-shot physics-consistent sampling, defined as inferring solutions from sparse observations using a fixed generative prior without task-specific retraining. The algorithm employs a rigorous two-step scheme that alternates between: (\romannumeral1) a terminal optimization step, which projects the flow prediction onto the intersection of the physically and observationally consistent sets via proximal minimization; and (\romannumeral2) an interpolation step, which maps the refined state back to the generative trajectory to maintain consistency with the learned flow probability path. This procedure admits a Bayesian interpretation as a sequence of local maximum a posteriori (MAP) updates. Comprehensive benchmarks on Poisson, Helmholtz, Darcy, and viscous Burgers' equations demonstrate that ProFlow achieves superior physical and observational consistency, as well as more accurate distributional statistics, compared to state-of-the-art diffusion- and flow-based baselines.
Read more →

Certificate-Guided Pruning for Stochastic Lipschitz Optimization

arXiv:2601.20231v1 Announce Type: cross Abstract: We study black-box optimization of Lipschitz functions under noisy evaluations. Existing adaptive discretization methods implicitly avoid suboptimal regions but do not provide explicit certificates of optimality or measurable progress guarantees. We introduce \textbf{Certificate-Guided Pruning (CGP)}, which maintains an explicit \emph{active set} $A_t$ of potentially optimal points via confidence-adjusted Lipschitz envelopes. Any point outside $A_t$ is certifiably suboptimal with high probability, and under a margin condition with near-optimality dimension $\alpha$, we prove $\Vol(A_t)$ shrinks at a controlled rate yielding sample complexity $\tildeO(\varepsilon^{-(2+\alpha)})$. We develop three extensions: CGP-Adaptive learns $L$ online with $O(\log T)$ overhead; CGP-TR scales to $d > 50$ via trust regions with local certificates; and CGP-Hybrid switches to GP refinement when local smoothness is detected. Experiments on 12 benchmarks ($d \in [2, 100]$) show CGP variants match or exceed strong baselines while providing principled stopping criteria via certificate volume.
Read more →

MALLOC: Benchmarking the Memory-aware Long Sequence Compression for Large Sequential Recommendation

arXiv:2601.20234v1 Announce Type: cross Abstract: The scaling law, which indicates that model performance improves with increasing dataset and model capacity, has fueled a growing trend in expanding recommendation models in both industry and academia. However, the advent of large-scale recommenders also brings significantly higher computational costs, particularly under the long-sequence dependencies inherent in the user intent of recommendation systems. Current approaches often rely on pre-storing the intermediate states of the past behavior for each user, thereby reducing the quadratic re-computation cost for the following requests. Despite their effectiveness, these methods often treat memory merely as a medium for acceleration, without adequately considering the space overhead it introduces. This presents a critical challenge in real-world recommendation systems with billions of users, each of whom might initiate thousands of interactions and require massive memory for state storage. Fortunately, there have been several memory management strategies examined for compression in LLM, while most have not been evaluated on the recommendation task. To mitigate this gap, we introduce MALLOC, a comprehensive benchmark for memory-aware long sequence compression. MALLOC presents a comprehensive investigation and systematic classification of memory management techniques applicable to large sequential recommendations. These techniques are integrated into state-of-the-art recommenders, enabling a reproducible and accessible evaluation platform. Through extensive experiments across accuracy, efficiency, and complexity, we demonstrate the holistic reliability of MALLOC in advancing large-scale recommendation. Code is available at https://anonymous.4open.science/r/MALLOC.
Read more →

How AI Impacts Skill Formation

arXiv:2601.20245v1 Announce Type: cross Abstract: AI assistance produces significant productivity gains across professional domains, particularly for novice workers. Yet how this assistance affects the development of skills required to effectively supervise AI remains unclear. Novice workers who rely heavily on AI to complete unfamiliar tasks may compromise their own skill acquisition in the process. We conduct randomized experiments to study how developers gained mastery of a new asynchronous programming library with and without the assistance of AI. We find that AI use impairs conceptual understanding, code reading, and debugging abilities, without delivering significant efficiency gains on average. Participants who fully delegated coding tasks showed some productivity improvements, but at the cost of learning the library. We identify six distinct AI interaction patterns, three of which involve cognitive engagement and preserve learning outcomes even when participants receive AI assistance. Our findings suggest that AI-enhanced productivity is not a shortcut to competence and AI assistance should be carefully adopted into workflows to preserve skill formation -- particularly in safety-critical domains.
Read more →

Order-Optimal Sample Complexity of Rectified Flows

arXiv:2601.20250v1 Announce Type: cross Abstract: Recently, flow-based generative models have shown superior efficiency compared to diffusion models. In this paper, we study rectified flow models, which constrain transport trajectories to be linear from the base distribution to the data distribution. This structural restriction greatly accelerates sampling, often enabling high-quality generation with a single Euler step. Under standard assumptions on the neural network classes used to parameterize the velocity field and data distribution, we prove that rectified flows achieve sample complexity $\tilde{O}(\varepsilon^{-2})$. This improves on the best known $O(\varepsilon^{-4})$ bounds for flow matching model and matches the optimal rate for mean estimation. Our analysis exploits the particular structure of rectified flows: because the model is trained with a squared loss along linear paths, the associated hypothesis class admits a sharply controlled localized Rademacher complexity. This yields the improved, order-optimal sample complexity and provides a theoretical explanation for the strong empirical performance of rectified flow models.
Read more →

Automated Benchmark Generation from Domain Guidelines Informed by Bloom's Taxonomy

arXiv:2601.20253v1 Announce Type: cross Abstract: Open-ended question answering (QA) evaluates a model's ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom's Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.
Read more →

Robust SDE Parameter Estimation Under Missing Time Information Setting

arXiv:2601.20268v1 Announce Type: cross Abstract: Recent advances in stochastic differential equations (SDEs) have enabled robust modeling of real-world dynamical processes across diverse domains, such as finance, health, and systems biology. However, parameter estimation for SDEs typically relies on accurately timestamped observational sequences. When temporal ordering information is corrupted, missing, or deliberately hidden (e.g., for privacy), existing estimation methods often fail. In this paper, we investigate the conditions under which temporal order can be recovered and introduce a novel framework that simultaneously reconstructs temporal information and estimates SDE parameters. Our approach exploits asymmetries between forward and backward processes, deriving a score-matching criterion to infer the correct temporal order between pairs of observations. We then recover the total order via a sorting procedure and estimate SDE parameters from the reconstructed sequence using maximum likelihood. Finally, we conduct extensive experiments on synthetic and real-world datasets to demonstrate the effectiveness of our method, extending parameter estimation to settings with missing temporal order and broadening applicability in sensitive domains.
Read more →

Eliciting Least-to-Most Reasoning for Phishing URL Detection

arXiv:2601.20270v1 Announce Type: cross Abstract: Phishing continues to be one of the most prevalent attack vectors, making accurate classification of phishing URLs essential. Recently, large language models (LLMs) have demonstrated promising results in phishing URL detection. However, their reasoning capabilities that enabled such performance remain underexplored. To this end, in this paper, we propose a Least-to-Most prompting framework for phishing URL detection. In particular, we introduce an "answer sensitivity" mechanism that guides Least-to-Most's iterative approach to enhance reasoning and yield higher prediction accuracy. We evaluate our framework using three URL datasets and four state-of-the-art LLMs, comparing against a one-shot approach and a supervised model. We demonstrate that our framework outperforms the one-shot baseline while achieving performance comparable to that of the supervised model, despite requiring significantly less training data. Furthermore, our in-depth analysis highlights how the iterative reasoning enabled by Least-to-Most, and reinforced by our answer sensitivity mechanism, drives these performance gains. Overall, we show that this simple yet powerful prompting strategy consistently outperforms both one-shot and supervised approaches, despite requiring minimal training or few-shot guidance. Our experimental setup can be found in our Github repository github.sydney.edu.au/htri0928/least-to-most-phishing-detection.
Read more →

Beyond the Needle's Illusion: Decoupled Evaluation of Evidence Access and Use under Semantic Interference at 326M-Token Scale

arXiv:2601.20276v1 Announce Type: cross Abstract: Long-context LLM agents must access the right evidence from large environments and use it faithfully. However, the popular Needle-in-a-Haystack (NIAH) evaluation mostly measures benign span localization. The needle is near-unique, and the haystack is largely irrelevant. We introduce EverMemBench-S (EMB-S), an adversarial NIAH-style benchmark built on a 326M-token MemoryBank. While the full MemoryBank spans 326M tokens for retrieval-based (RAG) evaluation, we evaluate native long-context models only at scales that fit within each model's context window (up to 1M tokens in this work) to ensure a fair comparison. EMB-S pairs queries with collision-tested near-miss hard negatives and gold evidence sets spanning one or more documents, validated via human screening and LLM verification. We also propose a decoupled diagnostic protocol that reports evidence access (document-ID localization) separately from end-to-end QA quality under full-context prompting. This enables consistent diagnosis for both native long-context prompting and retrieval pipelines. Across a reference-corpus ladder from domain-isolated 64K contexts to a globally shared 326M-token environment, we observe a clear reality gap. Systems that saturate benign NIAH degrade sharply in evidence access under semantic interference. These results indicate that semantic discrimination, not context length alone, is the dominant bottleneck for long-context memory at scale.
Read more →

The Forecast After the Forecast: A Post-Processing Shift in Time Series

arXiv:2601.20280v1 Announce Type: cross Abstract: Time series forecasting has long been dominated by advances in model architecture, with recent progress driven by deep learning and hybrid statistical techniques. However, as forecasting models approach diminishing returns in accuracy, a critical yet underexplored opportunity emerges: the strategic use of post-processing. In this paper, we address the last-mile gap in time-series forecasting, which is to improve accuracy and uncertainty without retraining or modifying a deployed backbone. We propose $\delta$-Adapter, a lightweight, architecture-agnostic way to boost deployed time series forecasters without retraining. $\delta$-Adapter learns tiny, bounded modules at two interfaces: input nudging (soft edits to covariates) and output residual correction. We provide local descent guarantees, $O(\delta)$ drift bounds, and compositional stability for combined adapters. Meanwhile, it can act as a feature selector by learning a sparse, horizon-aware mask over inputs to select important features, thereby improving interpretability. In addition, it can also be used as a distribution calibrator to measure uncertainty. Thus, we introduce a Quantile Calibrator and a Conformal Corrector that together deliver calibrated, personalized intervals with finite-sample coverage. Our experiments across diverse backbones and datasets show that $\delta$-Adapter improves accuracy and calibration with negligible compute and no interface changes.
Read more →

Cheap2Rich: A Multi-Fidelity Framework for Data Assimilation and System Identification of Multiscale Physics -- Rotating Detonation Engines

arXiv:2601.20295v1 Announce Type: cross Abstract: Bridging the sim2real gap between computationally inexpensive models and complex physical systems remains a central challenge in machine learning applications to engineering problems, particularly in multi-scale settings where reduced-order models typically capture only dominant dynamics. In this work, we present Cheap2Rich, a multi-scale data assimilation framework that reconstructs high-fidelity state spaces from sparse sensor histories by combining a fast low-fidelity prior with learned, interpretable discrepancy corrections. We demonstrate the performance on rotating detonation engines (RDEs), a challenging class of systems that couple detonation-front propagation with injector-driven unsteadiness, mixing, and stiff chemistry across disparate scales. Our approach successfully reconstructs high-fidelity RDE states from sparse measurements while isolating physically meaningful discrepancy dynamics associated with injector-driven effects. The results highlight a general multi-fidelity framework for data assimilation and system identification in complex multi-scale systems, enabling rapid design exploration and real-time monitoring and control while providing interpretable discrepancy dynamics. Code for this project is is available at: github.com/kro0l1k/Cheap2Rich.
Read more →

Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction

arXiv:2601.20299v1 Announce Type: cross Abstract: The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating frontier models. In such cases, models are demonstrated to exploit evaluations built on such imperfect supervision, leading to deceptive results. However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic incentive compatibility, i.e., eliciting honest and informative answers with weak supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels. We demonstrate the method's effectiveness and resistance to deception, with both theoretical guarantees and empirical validation on models with up to 405B parameters. We show that training an 8B model with peer prediction-based reward recovers most of the drop in truthfulness due to prior malicious finetuning, even when the reward is produced by a 0.135B language model with no finetuning. On the evaluation front, in contrast to LLM-as-a-Judge which requires strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception is strengthened as the capability gap between the experts and participants widens, enabling reliable evaluation of strong models with weak supervision. In particular, LLM-as-a-Judge become worse than random guess when facing deceptive models 5-20x the judge's size, while peer prediction thrives when such gaps are large, including in cases with over 100x size difference.
Read more →

Towards Compact and Robust DNNs via Compression-aware Sharpness Minimization

arXiv:2601.20301v1 Announce Type: cross Abstract: Sharpness-Aware Minimization (SAM) has recently emerged as an effective technique for improving DNN robustness to input variations. However, its interplay with the compactness requirements of on-device DNN deployments remains less explored. Simply pruning a SAM-trained model can undermine robustness, since flatness in the continuous parameter space does not necessarily translate to robustness under the discrete structural changes induced by pruning. Conversely, applying SAM after pruning may be fundamentally constrained by architectural limitations imposed by an early, robustness-agnostic pruning pattern. To address this gap, we propose Compression-aware ShArpness Minimization (C-SAM), a framework that shifts sharpness-aware learning from parameter perturbations to mask perturbations. By explicitly perturbing pruning masks during training, C-SAM promotes a flatter loss landscape with respect to model structure, enabling the discovery of pruning patterns that simultaneously optimize model compactness and robustness to input variations. Extensive experiments on CelebA-HQ, Flowers-102, and CIFAR-10-C across ResNet-18, GoogLeNet, and MobileNet-V2 show that C-SAM consistently achieves higher certified robustness than strong baselines, with improvements of up to 42%, while maintaining task accuracy comparable to the corresponding unpruned models.
Read more →

Physically Guided Visual Mass Estimation from a Single RGB Image

arXiv:2601.20303v1 Announce Type: cross Abstract: Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.
Read more →

Structure-constrained Language-informed Diffusion Model for Unpaired Low-dose Computed Tomography Angiography Reconstruction

arXiv:2601.20304v1 Announce Type: cross Abstract: The application of iodinated contrast media (ICM) improves the sensitivity and specificity of computed tomography (CT) for a wide range of clinical indications. However, overdose of ICM can cause problems such as kidney damage and life-threatening allergic reactions. Deep learning methods can generate CT images of normal-dose ICM from low-dose ICM, reducing the required dose while maintaining diagnostic power. However, existing methods are difficult to realize accurate enhancement with incompletely paired images, mainly because of the limited ability of the model to recognize specific structures. To overcome this limitation, we propose a Structure-constrained Language-informed Diffusion Model (SLDM), a unified medical generation model that integrates structural synergy and spatial intelligence. First, the structural prior information of the image is effectively extracted to constrain the model inference process, thus ensuring structural consistency in the enhancement process. Subsequently, semantic supervision strategy with spatial intelligence is introduced, which integrates the functions of visual perception and spatial reasoning, thus prompting the model to achieve accurate enhancement. Finally, the subtraction angiography enhancement module is applied, which serves to improve the contrast of the ICM agent region to suitable interval for observation. Qualitative analysis of visual comparison and quantitative results of several metrics demonstrate the effectiveness of our method in angiographic reconstruction for low-dose contrast medium CT angiography.
Read more →

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

arXiv:2601.20309v1 Announce Type: cross Abstract: Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.
Read more →

DiagLink: A Dual-User Diagnostic Assistance System by Synergizing Experts with LLMs and Knowledge Graphs

arXiv:2601.20311v1 Announce Type: cross Abstract: The global shortage and uneven distribution of medical expertise continue to hinder equitable access to accurate diagnostic care. While existing intelligent diagnostic system have shown promise, most struggle with dual-user interaction, and dynamic knowledge integration -- limiting their real-world applicability. In this study, we present DiagLink, a dual-user diagnostic assistance system that synergizes large language models (LLMs), knowledge graphs (KGs), and medical experts to support both patients and physicians. DiagLink uses guided dialogues to elicit patient histories, leverages LLMs and KGs for collaborative reasoning, and incorporates physician oversight for continuous knowledge validation and evolution. The system provides a role-adaptive interface, dynamically visualized history, and unified multi-source evidence to improve both trust and usability. We evaluate DiagLink through user study, use cases and expert interviews, demonstrating its effectiveness in improving user satisfaction and diagnostic efficiency, while offering insights for the design of future AI-assisted diagnostic systems.
Read more →

Beyond Speedup -- Utilizing KV Cache for Sampling and Reasoning

arXiv:2601.20326v1 Announce Type: cross Abstract: KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV-Embedding.
Read more →

Demonstration-Free Robotic Control via LLM Agents

arXiv:2601.20334v1 Announce Type: cross Abstract: Robotic manipulation has increasingly adopted vision-language-action (VLA) models, which achieve strong performance but typically require task-specific demonstrations and fine-tuning, and often generalize poorly under domain shift. We investigate whether general-purpose large language model (LLM) agent frameworks, originally developed for software engineering, can serve as an alternative control paradigm for embodied manipulation. We introduce FAEA (Frontier Agent as Embodied Agent), which applies an LLM agent framework directly to embodied manipulation without modification. Using the same iterative reasoning that enables software agents to debug code, FAEA enables embodied agents to reason through manipulation strategies. We evaluate an unmodified frontier agent, Claude Agent SDK, across the LIBERO, ManiSkill3, and MetaWorld benchmarks. With privileged environment state access, FAEA achieves success rates of 84.9%, 85.7%, and 96%, respectively. This level of task success approaches that of VLA models trained with less than 100 demonstrations per task, without requiring demonstrations or fine-tuning. With one round of human feedback as an optional optimization, performance increases to 88.2% on LIBERO. This demonstration-free capability has immediate practical value: FAEA can autonomously explore novel scenarios in simulation and generate successful trajectories for training data augmentation in embodied learning. Our results indicate that general-purpose agents are sufficient for a class of manipulation tasks dominated by deliberative, task-level planning. This opens a path for robotics systems to leverage actively maintained agent infrastructure and benefit directly from ongoing advances in frontier models. Code is available at https://github.com/robiemusketeer/faea-sim
Read more →

MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment

arXiv:2601.20335v1 Announce Type: cross Abstract: Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents' task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.
Read more →

Multimodal Multi-Agent Ransomware Analysis Using AutoGen

arXiv:2601.20346v1 Announce Type: cross Abstract: Ransomware has become one of the most serious cybersecurity threats causing major financial losses and operational disruptions worldwide.Traditional detection methods such as static analysis, heuristic scanning and behavioral analysis often fall short when used alone. To address these limitations, this paper presents multimodal multi agent ransomware analysis framework designed for ransomware classification. Proposed multimodal multiagent architecture combines information from static, dynamic and network sources. Each data type is handled by specialized agent that uses auto encoder based feature extraction. These representations are then integrated through a fusion agent. After that fused representation are used by transformer based classifier. It identifies the specific ransomware family. The agents interact through an interagent feedback mechanism that iteratively refines feature representations by suppressing low confidence information. The framework was evaluated on large scale datasets containing thousands of ransomware and benign samples. Multiple experiments were conducted on ransomware dataset. It outperforms single modality and nonadaptive fusion baseline achieving improvement of up to 0.936 in Macro-F1 for family classification and reducing calibration error. Over 100 epochs, the agentic feedback loop displays a stable monotonic convergence leading to over +0.75 absolute improvement in terms of agent quality and a final composite score of around 0.88 without fine tuning of the language models. Zeroday ransomware detection remains family dependent on polymorphism and modality disruptions. Confidence aware abstention enables reliable real world deployment by favoring conservativeand trustworthy decisions over forced classification. The findings indicate that proposed approach provides a practical andeffective path toward improving real world ransomware defense systems.
Read more →

CURVE: Learning Causality-Inspired Invariant Representations for Robust Scene Understanding via Uncertainty-Guided Regularization

arXiv:2601.20355v1 Announce Type: cross Abstract: Scene graphs provide structured abstractions for scene understanding, yet they often overfit to spurious correlations, severely hindering out-of-distribution generalization. To address this limitation, we propose CURVE, a causality-inspired framework that integrates variational uncertainty modeling with uncertainty-guided structural regularization to suppress high-variance, environment-specific relations. Specifically, we apply prototype-conditioned debiasing to disentangle invariant interaction dynamics from environment-dependent variations, promoting a sparse and domain-stable topology. Empirically, we evaluate CURVE in zero-shot transfer and low-data sim-to-real adaptation, verifying its ability to learn domain-stable sparse topologies and provide reliable uncertainty estimates to support risk prediction under distribution shifts.
Read more →

Switchcodec: Adaptive residual-expert sparse quantization for high-fidelity neural audio coding

arXiv:2601.20362v1 Announce Type: cross Abstract: Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex. To address this limitation, we propose SwitchCodec, a neural audio codec based on Residual Experts Vector Quantization (REVQ). REVQ combines a shared quantizer with dynamically routed expert quantizers that are activated according to the input audio, decoupling bitrate from codebook capacity and improving compression efficiency. This design ensures full training and utilization of each quantizer. In addition, a variable-bitrate mechanism adjusts the number of active expert quantizers at inference, enabling multi-bitrate operation without retraining. Experiments demonstrate that SwitchCodec surpasses existing baselines on both objective metrics and subjective listening tests.
Read more →

Can Continuous-Time Diffusion Models Generate and Solve Globally Constrained Discrete Problems? A Study on Sudoku

arXiv:2601.20363v1 Announce Type: cross Abstract: Can standard continuous-time generative models represent distributions whose support is an extremely sparse, globally constrained discrete set? We study this question using completed Sudoku grids as a controlled testbed, treating them as a subset of a continuous relaxation space. We train flow-matching and score-based models along a Gaussian probability path and compare deterministic (ODE) sampling, stochastic (SDE) sampling, and DDPM-style discretizations derived from the same continuous-time training. Unconditionally, stochastic sampling substantially outperforms deterministic flows; score-based samplers are the most reliable among continuous-time methods, and DDPM-style ancestral sampling achieves the highest validity overall. We further show that the same models can be repurposed for guided generation: by repeatedly sampling completions under clamped clues and stopping when constraints are satisfied, the model acts as a probabilistic Sudoku solver. Although far less sample-efficient than classical solvers and discrete-geometry-aware diffusion methods, these experiments demonstrate that classic diffusion/flow formulations can assign non-zero probability mass to globally constrained combinatorial structures and can be used for constraint satisfaction via stochastic search.
Read more →

LLM-AutoDP: Automatic Data Processing via LLM Agents for Model Fine-tuning

arXiv:2601.20375v1 Announce Type: cross Abstract: Large Language Models (LLMs) can be fine-tuned on domain-specific data to enhance their performance in specialized fields. However, such data often contains numerous low-quality samples, necessitating effective data processing (DP). In practice, DP strategies are typically developed through iterative manual analysis and trial-and-error adjustment. These processes inevitably incur high labor costs and may lead to privacy issues in high-privacy domains like healthcare due to direct human access to sensitive data. Thus, achieving automated data processing without exposing the raw data has become a critical challenge. To address this challenge, we propose LLM-AutoDP, a novel framework that leverages LLMs as agents to automatically generate and optimize data processing strategies. Our method generates multiple candidate strategies and iteratively refines them using feedback signals and comparative evaluations. This iterative in-context learning mechanism enables the agent to converge toward high-quality processing pipelines without requiring direct human intervention or access to the underlying data. To further accelerate strategy search, we introduce three key techniques: Distribution Preserving Sampling, which reduces data volume while maintaining distributional integrity; Processing Target Selection, which uses a binary classifier to identify low-quality samples for focused processing; Cache-and-Reuse Mechanism}, which minimizes redundant computations by reusing prior processing results. Results show that models trained on data processed by our framework achieve over 80% win rates against models trained on unprocessed data. Compared to AutoML baselines based on LLM agents, LLM-AutoDP achieves approximately a 65% win rate. Moreover, our acceleration techniques reduce the total searching time by up to 10 times, demonstrating both effectiveness and efficiency.
Read more →

FedRD: Reducing Divergences for Generalized Federated Learning via Heterogeneity-aware Parameter Guidance

arXiv:2601.20397v1 Announce Type: cross Abstract: Heterogeneous federated learning (HFL) aims to ensure effective and privacy-preserving collaboration among different entities. As newly joined clients require significant adjustments and additional training to align with the existing system, the problem of generalizing federated learning models to unseen clients under heterogeneous data has become progressively crucial. Consequently, we highlight two unsolved challenging issues in federated domain generalization: Optimization Divergence and Performance Divergence. To tackle the above challenges, we propose FedRD, a novel heterogeneity-aware federated learning algorithm that collaboratively utilizes parameter-guided global generalization aggregation and local debiased classification to reduce divergences, aiming to obtain an optimal global model for participating and unseen clients. Extensive experiments on public multi-domain datasets demonstrate that our approach exhibits a substantial performance advantage over competing baselines in addressing this specific problem.
Read more →

GuideAI: A Real-time Personalized Learning Solution with Adaptive Interventions

arXiv:2601.20402v1 Announce Type: cross Abstract: Large Language Models (LLMs) have emerged as powerful learning tools, but they lack awareness of learners' cognitive and physiological states, limiting their adaptability to the user's learning style. Contemporary learning techniques primarily focus on structured learning paths, knowledge tracing, and generic adaptive testing but fail to address real-time learning challenges driven by cognitive load, attention fluctuations, and engagement levels. Building on findings from a formative user study (N=66), we introduce GuideAI, a multi-modal framework that enhances LLM-driven learning by integrating real-time biosensory feedback including eye gaze tracking, heart rate variability, posture detection, and digital note-taking behavior. GuideAI dynamically adapts learning content and pacing through cognitive optimizations (adjusting complexity based on learning progress markers), physiological interventions (breathing guidance and posture correction), and attention-aware strategies (redirecting focus using gaze analysis). Additionally, GuideAI supports diverse learning modalities, including text-based, image-based, audio-based, and video-based instruction, across varied knowledge domains. A preliminary study (N = 25) assessed GuideAI's impact on knowledge retention and cognitive load through standardized assessments. The results show statistically significant improvements in both problem-solving capability and recall-based knowledge assessments. Participants also experienced notable reductions in key NASA-TLX measures including mental demand, frustration levels, and effort, while simultaneously reporting enhanced perceived performance. These findings demonstrate GuideAI's potential to bridge the gap between current LLM-based learning systems and individualized learner needs, paving the way for adaptive, cognition-aware education at scale.
Read more →

On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents

arXiv:2601.20404v1 Announce Type: cross Abstract: AI coding agents such as Codex and Claude Code are increasingly used to autonomously contribute to software repositories. However, little is known about how repository-level configuration artifacts affect operational efficiency of the agents. In this paper, we study the impact of AGENTS.md files on the runtime and token consumption of AI coding agents operating on GitHub pull requests. We analyze 10 repositories and 124 pull requests, executing agents under two conditions: with and without an AGENTS.md file. We measure wall-clock execution time and token usage during agent execution. Our results show that the presence of AGENTS.md is associated with a lower median runtime ($\Delta 28.64$%) and reduced output token consumption ($\Delta 16.58$%), while maintaining a comparable task completion behavior. Based on these results, we discuss immediate implications for the configuration and deployment of AI coding agents in practice, and outline a broader research agenda on the role of repository-level instructions in shaping the behavior, efficiency, and integration of AI coding agents in software development workflows.
Read more →

Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

arXiv:2601.20408v1 Announce Type: cross Abstract: Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OptiKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OptiKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2x GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource allocation algorithms, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility.
Read more →

Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

arXiv:2601.20419v1 Announce Type: cross Abstract: Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.
Read more →

Guiding the Recommender: Information-Aware Auto-Bidding for Content Promotion

arXiv:2601.20422v1 Announce Type: cross Abstract: Modern content platforms offer paid promotion to mitigate cold start by allocating exposure via auctions. Our empirical analysis reveals a counterintuitive flaw in this paradigm: while promotion rescues low-to-medium quality content, it can harm high-quality content by forcing exposure to suboptimal audiences, polluting engagement signals and downgrading future recommendation. We recast content promotion as a dual-objective optimization that balances short-term value acquisition with long-term model improvement. To make this tractable at bid time in content promotion, we introduce a decomposable surrogate objective, gradient coverage, and establish its formal connection to Fisher Information and optimal experimental design. We design a two-stage auto-bidding algorithm based on Lagrange duality that dynamically paces budget through a shadow price and optimizes impression-level bids using per-impression marginal utilities. To address missing labels at bid time, we propose a confidence-gated gradient heuristic, paired with a zeroth-order variant for black-box models that reliably estimates learning signals in real time. We provide theoretical guarantees, proving monotone submodularity of the composite objective, sublinear regret in online auction, and budget feasibility. Extensive offline experiments on synthetic and real-world datasets validate the framework: it outperforms baselines, achieves superior final AUC/LogLoss, adheres closely to budget targets, and remains effective when gradients are approximated zeroth-order. These results show that strategic, information-aware promotion can improve long-term model performance and organic outcomes beyond naive impression-maximization strategies.
Read more →

Self Voice Conversion as an Attack against Neural Audio Watermarking

arXiv:2601.20432v1 Announce Type: cross Abstract: Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the rise of deep learning-based attacks introduces novel and significant threats to watermark security. In this work, we investigate self voice conversion as a universal, content-preserving attack against audio watermarking systems. Self voice conversion remaps a speaker's voice to the same identity while altering acoustic characteristics through a voice conversion model. We demonstrate that this attack severely degrades the reliability of state-of-the-art watermarking approaches and highlight its implications for the security of modern audio watermarking techniques.
Read more →

Assembling the Mind's Mosaic: Towards EEG Semantic Intent Decoding

arXiv:2601.20447v1 Announce Type: cross Abstract: Enabling natural communication through brain-computer interfaces (BCIs) remains one of the most profound challenges in neuroscience and neurotechnology. While existing frameworks offer partial solutions, they are constrained by oversimplified semantic representations and a lack of interpretability. To overcome these limitations, we introduce Semantic Intent Decoding (SID), a novel framework that translates neural activity into natural language by modeling meaning as a flexible set of compositional semantic units. SID is built on three core principles: semantic compositionality, continuity and expandability of semantic space, and fidelity in reconstruction. We present BrainMosaic, a deep learning architecture implementing SID. BrainMosaic decodes multiple semantic units from EEG/SEEG signals using set matching and then reconstructs coherent sentences through semantic-guided reconstruction. This approach moves beyond traditional pipelines that rely on fixed-class classification or unconstrained generation, enabling a more interpretable and expressive communication paradigm. Extensive experiments on multilingual EEG and clinical SEEG datasets demonstrate that SID and BrainMosaic offer substantial advantages over existing frameworks, paving the way for natural and effective BCI-mediated communication.
Read more →

Fair Recourse for All: Ensuring Individual and Group Fairness in Counterfactual Explanations

arXiv:2601.20449v1 Announce Type: cross Abstract: Explainable Artificial Intelligence (XAI) is becoming increasingly essential for enhancing the transparency of machine learning (ML) models. Among the various XAI techniques, counterfactual explanations (CFs) hold a pivotal role due to their ability to illustrate how changes in input features can alter an ML model's decision, thereby offering actionable recourse to users. Ensuring that individuals with comparable attributes and those belonging to different protected groups (e.g., demographic) receive similar and actionable recourse options is essential for trustworthy and fair decision-making. In this work, we address this challenge directly by focusing on the generation of fair CFs. Specifically, we start by defining and formulating fairness at: 1) individual fairness, ensuring that similar individuals receive similar CFs, 2) group fairness, ensuring equitable CFs across different protected groups and 3) hybrid fairness, which accounts for both individual and broader group-level fairness. We formulate the problem as an optimization task and propose a novel model-agnostic, reinforcement learning based approach to generate CFs that satisfy fairness constraints at both the individual and group levels, two objectives that are usually treated as orthogonal. As fairness metrics, we extend existing metrics commonly used for auditing ML models, such as equal choice of recourse and equal effectiveness across individuals and groups. We evaluate our approach on three benchmark datasets, showing that it effectively ensures individual and group fairness while preserving the quality of the generated CFs in terms of proximity and plausibility, and quantify the cost of fairness in the different levels separately. Our work opens a broader discussion on hybrid fairness and its role and implications for XAI and beyond CFs.
Read more →

Comparative evaluation of training strategies using partially labelled datasets for segmentation of white matter hyperintensities and stroke lesions in FLAIR MRI

arXiv:2601.20503v1 Announce Type: cross Abstract: White matter hyperintensities (WMH) and ischaemic stroke lesions (ISL) are imaging features associated with cerebral small vessel disease (SVD) that are visible on brain magnetic resonance imaging (MRI) scans. The development and validation of deep learning models to segment and differentiate these features is difficult because they visually confound each other in the fluid-attenuated inversion recovery (FLAIR) sequence and often appear in the same subject. We investigated six strategies for training a combined WMH and ISL segmentation model using partially labelled data. We combined privately held fully and partially labelled datasets with publicly available partially labelled datasets to yield a total of 2052 MRI volumes, with 1341 and 1152 containing ground truth annotations for WMH and ISL respectively. We found that several methods were able to effectively leverage the partially labelled data to improve model performance, with the use of pseudolabels yielding the best result.
Read more →

Audio Deepfake Detection in the Age of Advanced Text-to-Speech models

arXiv:2601.20510v1 Announce Type: cross Abstract: Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.
Read more →

CCMamba: Selective State-Space Models for Higher-Order Graph Learning on Combinatorial Complexes

arXiv:2601.20518v1 Announce Type: cross Abstract: Topological deep learning has emerged for modeling higher-order relational structures beyond pairwise interactions that standard graph neural networks fail to capture. Although combinatorial complexes offer a unified topological framework, most existing topological deep learning methods rely on local message passing via attention mechanisms, which incur quadratic complexity and remain low-dimensional, limiting scalability and rank-aware information aggregation in higher-order complexes.We propose Combinatorial Complex Mamba (CCMamba), the first unified mamba-based neural framework for learning on combinatorial complexes. CCMamba reformulates message passing as a selective state-space modeling problem by organizing multi-rank incidence relations into structured sequences processed by rank-aware state-space models. This enables adaptive, directional, and long range information propagation in linear time without self attention. We further establish the theoretical analysis that the expressive power upper-bound of CCMamba message passing is the 1-Weisfeiler-Lehman test. Experiments on graph, hypergraph, and simplicial benchmarks demonstrate that CCMamba consistently outperforms existing methods while exhibiting improved scalability and robustness to depth.
Read more →

Interpreting Emergent Extreme Events in Multi-Agent Systems

arXiv:2601.20538v1 Announce Type: cross Abstract: Large language model-powered multi-agent systems have emerged as powerful tools for simulating complex human-like systems. The interactions within these systems often lead to extreme events whose origins remain obscured by the black box of emergence. Interpreting these events is critical for system safety. This paper proposes the first framework for explaining emergent extreme events in multi-agent systems, aiming to answer three fundamental questions: When does the event originate? Who drives it? And what behaviors contribute to it? Specifically, we adapt the Shapley value to faithfully attribute the occurrence of extreme events to each action taken by agents at different time steps, i.e., assigning an attribution score to the action to measure its influence on the event. We then aggregate the attribution scores along the dimensions of time, agent, and behavior to quantify the risk contribution of each dimension. Finally, we design a set of metrics based on these contribution scores to characterize the features of extreme events. Experiments across diverse multi-agent system scenarios (economic, financial, and social) demonstrate the effectiveness of our framework and provide general insights into the emergence of extreme phenomena.
Read more →

IoT Device Identification with Machine Learning: Common Pitfalls and Best Practices

arXiv:2601.20548v1 Announce Type: cross Abstract: This paper critically examines the device identification process using machine learning, addressing common pitfalls in existing literature. We analyze the trade-offs between identification methods (unique vs. class based), data heterogeneity, feature extraction challenges, and evaluation metrics. By highlighting specific errors, such as improper data augmentation and misleading session identifiers, we provide a robust guideline for researchers to enhance the reproducibility and generalizability of IoT security models.
Read more →

Unsupervised Ensemble Learning Through Deep Energy-based Models

arXiv:2601.20556v1 Announce Type: cross Abstract: Unsupervised ensemble learning emerged to address the challenge of combining multiple learners' predictions without access to ground truth labels or additional data. This paradigm is crucial in scenarios where evaluating individual classifier performance or understanding their strengths is challenging due to limited information. We propose a novel deep energy-based method for constructing an accurate meta-learner using only the predictions of individual learners, potentially capable of capturing complex dependence structures between them. Our approach requires no labeled data, learner features, or problem-specific information, and has theoretical guarantees for when learners are conditionally independent. We demonstrate superior performance across diverse ensemble scenarios, including challenging mixture of experts settings. Our experiments span standard ensemble datasets and curated datasets designed to test how the model fuses expertise from multiple sources. These results highlight the potential of unsupervised ensemble learning to harness collective intelligence, especially in data-scarce or privacy-sensitive environments.
Read more →

Robust Distributed Learning under Resource Constraints: Decentralized Quantile Estimation via (Asynchronous) ADMM

arXiv:2601.20571v1 Announce Type: cross Abstract: Specifications for decentralized learning on resource-constrained edge devices require algorithms that are communication-efficient, robust to data corruption, and lightweight in memory usage. While state-of-the-art gossip-based methods satisfy the first requirement, achieving robustness remains challenging. Asynchronous decentralized ADMM-based methods have been explored for estimating the median, a statistical centrality measure that is notoriously more robust than the mean. However, existing approaches require memory that scales with node degree, making them impractical when memory is limited. In this paper, we propose AsylADMM, a novel gossip algorithm for decentralized median and quantile estimation, primarily designed for asynchronous updates and requiring only two variables per node. We analyze a synchronous variant of AsylADMM to establish theoretical guarantees and empirically demonstrate fast convergence for the asynchronous algorithm. We then show that our algorithm enables quantile-based trimming, geometric median estimation, and depth-based trimming, with quantile-based trimming empirically outperforming existing rank-based methods. Finally, we provide a novel theoretical analysis of rank-based trimming via Markov chain theory.
Read more →

Inequality in Congestion Games with Learning Agents

arXiv:2601.20578v1 Announce Type: cross Abstract: Who benefits from expanding transport networks? While designed to improve mobility, such interventions can also create inequality. In this paper, we show that disparities arise not only from the structure of the network itself but also from differences in how commuters adapt to it. We model commuters as reinforcement learning agents who adapt their travel choices at different learning rates, reflecting unequal access to resources and information. To capture potential efficiency-fairness tradeoffs, we introduce the Price of Learning (PoL), a measure of inefficiency during learning. We analyze both a stylized network -- inspired in the well-known Braess's paradox, yet with two-source nodes -- and an abstraction of a real-world metro system (Amsterdam). Our simulations show that network expansions can simultaneously increase efficiency and amplify inequality, especially when faster learners disproportionately benefit from new routes before others adapt. These results highlight that transport policies must account not only for equilibrium outcomes but also for the heterogeneous ways commuters adapt, since both shape the balance between efficiency and fairness.
Read more →

Ranking-aware Reinforcement Learning for Ordinal Ranking

arXiv:2601.20585v1 Announce Type: cross Abstract: Ordinal regression and ranking are challenging due to inherent ordinal dependencies that conventional methods struggle to model. We propose Ranking-Aware Reinforcement Learning (RARL), a novel RL framework that explicitly learns these relationships. At its core, RARL features a unified objective that synergistically integrates regression and Learning-to-Rank (L2R), enabling mutual improvement between the two tasks. This is driven by a ranking-aware verifiable reward that jointly assesses regression precision and ranking accuracy, facilitating direct model updates via policy optimization. To further enhance training, we introduce Response Mutation Operations (RMO), which inject controlled noise to improve exploration and prevent stagnation at saddle points. The effectiveness of RARL is validated through extensive experiments on three distinct benchmarks.
Read more →

Person Re-ID in 2025: Supervised, Self-Supervised, and Language-Aligned. What Works?

arXiv:2601.20598v1 Announce Type: cross Abstract: Person Re-Identification (ReID) remains a challenging problem in computer vision. This work reviews various training paradigm and evaluates the robustness of state-of-the-art ReID models in cross-domain applications and examines the role of foundation models in improving generalization through richer, more transferable visual representations. We compare three training paradigms, supervised, self-supervised, and language-aligned models. Through the study the aim is to answer the following questions: Can supervised models generalize in cross-domain scenarios? How does foundation models like SigLIP2 perform for the ReID tasks? What are the weaknesses of current supervised and foundational models for ReID? We have conducted the analysis across 11 models and 9 datasets. Our results show a clear split: supervised models dominate their training domain but crumble on cross-domain data. Language-aligned models, however, show surprising robustness cross-domain for ReID tasks, even though they are not explicitly trained to do so. Code and data available at: https://github.com/moiiai-tech/object-reid-benchmark.
Read more →

Regularized Gradient Temporal-Difference Learning

arXiv:2601.20599v1 Announce Type: cross Abstract: Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error (MSPBE) minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.
Read more →

CLEAR-Mamba:Towards Accurate, Adaptive and Trustworthy Multi-Sequence Ophthalmic Angiography Classification

arXiv:2601.20601v1 Announce Type: cross Abstract: Medical image classification is a core task in computer-aided diagnosis (CAD), playing a pivotal role in early disease detection, treatment planning, and patient prognosis assessment. In ophthalmic practice, fluorescein fundus angiography (FFA) and indocyanine green angiography (ICGA) provide hemodynamic and lesion-structural information that conventional fundus photography cannot capture. However, due to the single-modality nature, subtle lesion patterns, and significant inter-device variability, existing methods still face limitations in generalization and high-confidence prediction. To address these challenges, we propose CLEAR-Mamba, an enhanced framework built upon MedMamba with optimizations in both architecture and training strategy. Architecturally, we introduce HaC, a hypernetwork-based adaptive conditioning layer that dynamically generates parameters according to input feature distributions, thereby improving cross-domain adaptability. From a training perspective, we develop RaP, a reliability-aware prediction scheme built upon evidential uncertainty learning, which encourages the model to emphasize low-confidence samples and improves overall stability and reliability. We further construct a large-scale ophthalmic angiography dataset covering both FFA and ICGA modalities, comprising multiple retinal disease categories for model training and evaluation. Experimental results demonstrate that CLEAR-Mamba consistently outperforms multiple baseline models, including the original MedMamba, across various metrics-showing particular advantages in multi-disease classification and reliability-aware prediction. This study provides an effective solution that balances generalizability and reliability for modality-specific medical image classification tasks.
Read more →

WFR-MFM: One-Step Inference for Dynamic Unbalanced Optimal Transport

arXiv:2601.20606v1 Announce Type: cross Abstract: Reconstructing dynamical evolution from limited observations is a fundamental challenge in single-cell biology, where dynamic unbalanced optimal transport provides a principled framework for modeling coupled transport and mass variation. However, existing approaches rely on trajectory simulation at inference time, making inference a key bottleneck for scalable applications. In this work, we propose a mean-flow framework for unbalanced flow matching that summarizes both transport and mass-growth dynamics over arbitrary time intervals using mean velocity and mass-growth fields, enabling fast one-step generation without trajectory simulation. To solve dynamic unbalanced optimal transport under the Wasserstein-Fisher-Rao geometry, we further build on this framework to develop Wasserstein-Fisher-Rao Mean Flow Matching (WFR-MFM). Across synthetic and real single-cell RNA sequencing datasets, WFR-MFM achieves orders-of-magnitude faster inference than a range of existing baselines while maintaining high predictive accuracy, and enables efficient perturbation response prediction on large synthetic datasets with thousands of conditions.
Read more →

Agent Benchmarks Fail Public Sector Requirements

arXiv:2601.20617v1 Announce Type: cross Abstract: Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.
Read more →

GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection

arXiv:2601.20618v1 Announce Type: cross Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text, alongside measuring visual-textual fidelity. These discrepancy features are then fused with visual and textual representations via a gated module to adaptively balance modality contributions. Extensive experiments on MSD benchmarks demonstrate GDCNet's superior accuracy and robustness, establishing a new state-of-the-art on the MMSD2.0 benchmark.
Read more →

Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability

arXiv:2601.20642v1 Announce Type: cross Abstract: Diffusion-based image generative models produce high-fidelity images through iterative denoising but remain vulnerable to memorization, where they unintentionally reproduce exact copies or parts of training images. Recent memorization detection methods are primarily based on the norm of score difference as indicators of memorization. We prove that such norm-based metrics are mainly effective under the assumption of isotropic log-probability distributions, which generally holds at high or medium noise levels. In contrast, analyzing the anisotropic regime reveals that memorized samples exhibit strong angular alignment between the guidance vector and unconditional scores in the low-noise setting. Through these insights, we develop a memorization detection metric by integrating isotropic norm and anisotropic alignment. Our detection metric can be computed directly on pure noise inputs via two conditional and unconditional forward passes, eliminating the need for costly denoising steps. Detection experiments on Stable Diffusion v1.4 and v2 show that our metric outperforms existing denoising-free detection methods while being at least approximately 5x faster than the previous best approach. Finally, we demonstrate the effectiveness of our approach by utilizing a mitigation strategy that adapts memorized prompts based on our developed metric.
Read more →

Learning Contextual Runtime Monitors for Safe AI-Based Autonomy

arXiv:2601.20666v1 Announce Type: cross Abstract: We introduce a novel framework for learning context-aware runtime monitors for AI-based control ensembles. Machine-learning (ML) controllers are increasingly deployed in (autonomous) cyber-physical systems because of their ability to solve complex decision-making tasks. However, their accuracy can degrade sharply in unfamiliar environments, creating significant safety concerns. Traditional ensemble methods aim to improve robustness by averaging or voting across multiple controllers, yet this often dilutes the specialized strengths that individual controllers exhibit in different operating contexts. We argue that, rather than blending controller outputs, a monitoring framework should identify and exploit these contextual strengths. In this paper, we reformulate the design of safe AI-based control ensembles as a contextual monitoring problem. A monitor continuously observes the system's context and selects the controller best suited to the current conditions. To achieve this, we cast monitor learning as a contextual learning task and draw on techniques from contextual multi-armed bandits. Our approach comes with two key benefits: (1) theoretical safety guarantees during controller selection, and (2) improved utilization of controller diversity. We validate our framework in two simulated autonomous driving scenarios, demonstrating significant improvements in both safety and performance compared to non-contextual baselines.
Read more →

Harnessing Large Language Models for Precision Querying and Retrieval-Augmented Knowledge Extraction in Clinical Data Science

arXiv:2601.20674v1 Announce Type: cross Abstract: This study applies Large Language Models (LLMs) to two foundational Electronic Health Record (EHR) data science tasks: structured data querying (using programmatic languages, Python/Pandas) and information extraction from unstructured clinical text via a Retrieval Augmented Generation (RAG) pipeline. We test the ability of LLMs to interact accurately with large structured datasets for analytics and the reliability of LLMs in extracting semantically correct information from free text health records when supported by RAG. To this end, we presented a flexible evaluation framework that automatically generates synthetic question and answer pairs tailored to the characteristics of each dataset or task. Experiments were conducted on a curated subset of MIMIC III, (four structured tables and one clinical note type), using a mix of locally hosted and API-based LLMs. Evaluation combined exact-match metrics, semantic similarity, and human judgment. Our findings demonstrate the potential of LLMs to support precise querying and accurate information extraction in clinical workflows.
Read more →

Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework

arXiv:2601.20689v1 Announce Type: cross Abstract: Recent multimodal large language models (MLLMs) have demonstrated strong capabilities in image quality assessment (IQA) tasks. However, adapting such large-scale models is computationally expensive and still relies on substantial Mean Opinion Score (MOS) annotations. We argue that for MLLM-based IQA, the core bottleneck lies not in the quality perception capacity of MLLMs, but in MOS scale calibration. Therefore, we propose LEAF, a Label-Efficient Image Quality Assessment Framework that distills perceptual quality priors from an MLLM teacher into a lightweight student regressor, enabling MOS calibration with minimal human supervision. Specifically, the teacher conducts dense supervision through point-wise judgments and pair-wise preferences, with an estimate of decision reliability. Guided by these signals, the student learns the teacher's quality perception patterns through joint distillation and is calibrated on a small MOS subset to align with human annotations. Experiments on both user-generated and AI-generated IQA benchmarks demonstrate that our method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations, making lightweight IQA practical under limited annotation budgets.
Read more →

LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos?

arXiv:2601.20705v1 Announce Type: cross Abstract: Recent multimodal large language models (MLLMs) have shown remarkable progress across vision, audio, and language tasks, yet their performance on long-form, knowledge-intensive, and temporally structured educational content remains largely unexplored. To bridge this gap, we introduce LEMON, a Lecture-based Evaluation benchmark for MultimOdal uNderstanding, focusing on STEM lecture videos that require long-horizon reasoning and cross-modal integration. LEMON comprises 2,277 video segments spanning 5 disciplines and 29 courses, with an average duration of 196.1 seconds, yielding 4,181 high-quality QA pairs, including 3,413 multiple-choice and 768 open-ended questions. Distinct from existing video benchmarks, LEMON features: (1) semantic richness and disciplinary density, (2) tightly coupled video-audio-text modalities, (3) explicit temporal and pedagogical structure, and (4) contextually linked multi-turn questioning. It further encompasses six major tasks and twelve subtasks, covering the full cognitive spectrum from perception to reasoning and then to generation. Comprehensive experiments reveal substantial performance gaps across tasks, highlighting that even state-of-the-art MLLMs like GPT-4o struggle with temporal reasoning and instructional prediction. We expect LEMON to serve as an extensible and challenging benchmark for advancing multimodal perception, reasoning, and generation in long-form instructional contents.
Read more →

Beyond GEMM-Centric NPUs: Enabling Efficient Diffusion LLM Sampling

arXiv:2601.20706v1 Announce Type: cross Abstract: Diffusion Large Language Models (dLLMs) introduce iterative denoising to enable parallel token generation, but their sampling phase displays fundamentally different characteristics compared to GEMM-centric transformer layers. Profiling on modern GPUs reveals that sampling can account for up to 70% of total model inference latency-primarily due to substantial memory loads and writes from vocabulary-wide logits, reduction-based token selection, and iterative masked updates. These processes demand large on-chip SRAM and involve irregular memory accesses that conventional NPUs struggle to handle efficiently. To address this, we identify a set of critical instructions that an NPU architecture must specifically optimize for dLLM sampling. Our design employs lightweight non-GEMM vector primitives, in-place memory reuse strategies, and a decoupled mixed-precision memory hierarchy. Together, these optimizations deliver up to a 2.53x speedup over the NVIDIA RTX A6000 GPU under an equivalent nm technology node. We also open-source our cycle-accurate simulation and post-synthesis RTL verification code, confirming functional equivalence with current dLLM PyTorch implementations.
Read more →

Adapting the Behavior of Reinforcement Learning Agents to Changing Action Spaces and Reward Functions

arXiv:2601.20714v1 Announce Type: cross Abstract: Reinforcement Learning (RL) agents often struggle in real-world applications where environmental conditions are non-stationary, particularly when reward functions shift or the available action space expands. This paper introduces MORPHIN, a self-adaptive Q-learning framework that enables on-the-fly adaptation without full retraining. By integrating concept drift detection with dynamic adjustments to learning and exploration hyperparameters, MORPHIN adapts agents to changes in both the reward function and on-the-fly expansions of the agent's action space, while preserving prior policy knowledge to prevent catastrophic forgetting. We validate our approach using a Gridworld benchmark and a traffic signal control simulation. The results demonstrate that MORPHIN achieves superior convergence speed and continuous adaptation compared to a standard Q-learning baseline, improving learning efficiency by up to 1.7x.
Read more →

Li-ViP3D++: Query-Gated Deformable Camera-LiDAR Fusion for End-to-End Perception and Trajectory Prediction

arXiv:2601.20720v1 Announce Type: cross Abstract: End-to-end perception and trajectory prediction from raw sensor data is one of the key capabilities for autonomous driving. Modular pipelines restrict information flow and can amplify upstream errors. Recent query-based, fully differentiable perception-and-prediction (PnP) models mitigate these issues, yet the complementarity of cameras and LiDAR in the query-space has not been sufficiently explored. Models often rely on fusion schemes that introduce heuristic alignment and discrete selection steps which prevent full utilization of available information and can introduce unwanted bias. We propose Li-ViP3D++, a query-based multimodal PnP framework that introduces Query-Gated Deformable Fusion (QGDF) to integrate multi-view RGB and LiDAR in query space. QGDF (i) aggregates image evidence via masked attention across cameras and feature levels, (ii) extracts LiDAR context through fully differentiable BEV sampling with learned per-query offsets, and (iii) applies query-conditioned gating to adaptively weight visual and geometric cues per agent. The resulting architecture jointly optimizes detection, tracking, and multi-hypothesis trajectory forecasting in a single end-to-end model. On nuScenes, Li-ViP3D++ improves end-to-end behavior and detection quality, achieving higher EPA (0.335) and mAP (0.502) while substantially reducing false positives (FP ratio 0.147), and it is faster than the prior Li-ViP3D variant (139.82 ms vs. 145.91 ms). These results indicate that query-space, fully differentiable camera-LiDAR fusion can increase robustness of end-to-end PnP without sacrificing deployability.
Read more →

QueerGen: How LLMs Reflect Societal Norms on Gender and Sexuality in Sentence Completion Tasks

arXiv:2601.20731v1 Announce Type: cross Abstract: This paper examines how Large Language Models (LLMs) reproduce societal norms, particularly heterocisnormativity, and how these norms translate into measurable biases in their text generations. We investigate whether explicit information about a subject's gender or sexuality influences LLM responses across three subject categories: queer-marked, non-queer-marked, and the normalized "unmarked" category. Representational imbalances are operationalized as measurable differences in English sentence completions across four dimensions: sentiment, regard, toxicity, and prediction diversity. Our findings show that Masked Language Models (MLMs) produce the least favorable sentiment, higher toxicity, and more negative regard for queer-marked subjects. Autoregressive Language Models (ARLMs) partially mitigate these patterns, while closed-access ARLMs tend to produce more harmful outputs for unmarked subjects. Results suggest that LLMs reproduce normative social assumptions, though the form and degree of bias depend strongly on specific model characteristics, which may redistribute, but not eliminate, representational harms.
Read more →

HESTIA: A Hessian-Guided Differentiable Quantization-Aware Training Framework for Extremely Low-Bit LLMs

arXiv:2601.20745v1 Announce Type: cross Abstract: As large language models (LLMs) continue to scale, deployment is increasingly bottlenecked by the memory wall, motivating a shift toward extremely low-bit quantization. However, most quantization-aware training (QAT) methods apply hard rounding and the straight-through estimator (STE) from the beginning of the training, which prematurely discretizes the optimization landscape and induces persistent gradient mismatch between latent weights and quantized weights, hindering effective optimization of quantized models. To address this, we propose Hestia, a Hessian-guided differentiable QAT framework for extremely low-bit LLMs, which replaces the rigid step function with a temperature-controlled softmax relaxation to maintain gradient flow early in training while progressively hardening quantization. Furthermore, Hestia leverages a tensor-wise Hessian trace metric as a lightweight curvature signal to drive fine-grained temperature annealing, enabling sensitivity-aware discretization across the model. Evaluations on Llama-3.2 show that Hestia consistently outperforms existing ternary QAT baselines, yielding average zero-shot improvements of 5.39% and 4.34% for the 1B and 3B models. These results indicate that Hessian-guided relaxation effectively recovers representational capacity, establishing a more robust training path for 1.58-bit LLMs. The code is available at https://github.com/hestia2026/Hestia.
Read more →

Independence of Approximate Clones

arXiv:2601.20779v1 Announce Type: cross Abstract: In an ordinal election, two candidates are said to be perfect clones if every voter ranks them adjacently. The independence of clones axiom then states that removing one of the two clones should not change the election outcome. This axiom has been extensively studied in social choice theory, and several voting rules are known to satisfy it (such as IRV, Ranked Pairs and Schulze). However, perfect clones are unlikely to occur in practice, especially for political elections with many voters. In this work, we study different notions of approximate clones in ordinal elections. Informally, two candidates are approximate clones in a preference profile if they are close to being perfect clones. We discuss two measures to quantify this proximity, and we show under which conditions the voting rules that are known to be independent of clones are also independent of approximate clones. In particular, we show that for elections with at least four candidates, none of these rules are independent of approximate clones in the general case. However, we find a more positive result for the case of three candidates. Finally, we conduct an empirical study of approximate clones and independence of approximate clones based on three real-world datasets: votes in local Scottish elections, votes in mini-jury deliberations, and votes of judges in figure skating competitions. We find that approximate clones are common in some contexts, and that the closest two candidates are to being perfect clones, the less likely their removal is to change the election outcome, especially for voting rules that are independent of perfect clones.
Read more →

FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models

arXiv:2601.20791v1 Announce Type: cross Abstract: Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality.
Read more →

Conditional PED-ANOVA: Hyperparameter Importance in Hierarchical & Dynamic Search Spaces

arXiv:2601.20800v1 Announce Type: cross Abstract: We propose conditional PED-ANOVA (condPED-ANOVA), a principled framework for estimating hyperparameter importance (HPI) in conditional search spaces, where the presence or domain of a hyperparameter can depend on other hyperparameters. Although the original PED-ANOVA provides a fast and efficient way to estimate HPI within the top-performing regions of the search space, it assumes a fixed, unconditional search space and therefore cannot properly handle conditional hyperparameters. To address this, we introduce a conditional HPI for top-performing regions and derive a closed-form estimator that accurately reflects conditional activation and domain changes. Experiments show that naive adaptations of existing HPI estimators yield misleading or uninterpretable importance estimates in conditional settings, whereas condPED-ANOVA consistently provides meaningful importances that reflect the underlying conditional structure.
Read more →

Reinforcement Learning via Self-Distillation

arXiv:2601.20802v1 Announce Type: cross Abstract: Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.
Read more →

GNN Explanations that do not Explain and How to find Them

arXiv:2601.20815v1 Announce Type: cross Abstract: Explanations provided by Self-explainable Graph Neural Networks (SE-GNNs) are fundamental for understanding the model's inner workings and for identifying potential misuse of sensitive attributes. Although recent works have highlighted that these explanations can be suboptimal and potentially misleading, a characterization of their failure cases is unavailable. In this work, we identify a critical failure of SE-GNN explanations: explanations can be unambiguously unrelated to how the SE-GNNs infer labels. We show that, on the one hand, many SE-GNNs can achieve optimal true risk while producing these degenerate explanations, and on the other, most faithfulness metrics can fail to identify these failure modes. Our empirical analysis reveals that degenerate explanations can be maliciously planted (allowing an attacker to hide the use of sensitive attributes) and can also emerge naturally, highlighting the need for reliable auditing. To address this, we introduce a novel faithfulness metric that reliably marks degenerate explanations as unfaithful, in both malicious and natural settings. Our code is available in the supplemental.
Read more →

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

arXiv:2601.20829v1 Announce Type: cross Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.
Read more →

Open-Vocabulary Functional 3D Human-Scene Interaction Generation

arXiv:2601.20835v1 Announce Type: cross Abstract: Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as "sitting on a sofa'', while supporting fine-grained functional human-scene interactions, e.g., "increasing the room temperature''. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.
Read more →

Reward Models Inherit Value Biases from Pretraining

arXiv:2601.20838v1 Announce Type: cross Abstract: Reward models (RMs) are central to aligning large language models (LLMs) with human values but have received less attention than pre-trained and post-trained LLMs themselves. Because RMs are initialized from LLMs, they inherit representations that shape their behavior, but the nature and extent of this influence remain understudied. In a comprehensive study of 10 leading open-weight RMs using validated psycholinguistic corpora, we show that RMs exhibit significant differences along multiple dimensions of human value as a function of their base model. Using the "Big Two" psychological axes, we show a robust preference of Llama RMs for "agency" and a corresponding robust preference of Gemma RMs for "communion." This phenomenon holds even when the preference data and finetuning process are identical, and we trace it back to the logits of the respective instruction-tuned and pre-trained models. These log-probability differences themselves can be formulated as an implicit RM; we derive usable implicit reward scores and show that they exhibit the very same agency/communion difference. We run experiments training RMs with ablations for preference data source and quantity, which demonstrate that this effect is not only repeatable but surprisingly durable. Despite RMs being designed to represent human preferences, our evidence shows that their outputs are influenced by the pretrained LLMs on which they are based. This work underscores the importance of safety and alignment efforts at the pretraining stage, and makes clear that open-source developers' choice of base model is as much a consideration of values as of performance.
Read more →

$\mathbb{R}^{2k}$ is Theoretically Large Enough for Embedding-based Top-$k$ Retrieval

arXiv:2601.20844v1 Announce Type: cross Abstract: This paper studies the minimal dimension required to embed subset memberships ($m$ elements and ${m\choose k}$ subsets of at most $k$ elements) into vector spaces, denoted as Minimal Embeddable Dimension (MED). The tight bounds of MED are derived theoretically and supported empirically for various notions of "distances" or "similarities," including the $\ell_2$ metric, inner product, and cosine similarity. In addition, we conduct numerical simulation in a more achievable setting, where the ${m\choose k}$ subset embeddings are chosen as the centroid of the embeddings of the contained elements. Our simulation easily realizes a logarithmic dependency between the MED and the number of elements to embed. These findings imply that embedding-based retrieval limitations stem primarily from learnability challenges, not geometric constraints, guiding future algorithm design.
Read more →

A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion

arXiv:2601.20847v1 Announce Type: cross Abstract: Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.
Read more →

Post-Training Fairness Control: A Single-Train Framework for Dynamic Fairness in Recommendation

arXiv:2601.20848v1 Announce Type: cross Abstract: Despite growing efforts to mitigate unfairness in recommender systems, existing fairness-aware methods typically fix the fairness requirement at training time and provide limited post-training flexibility. However, in real-world scenarios, diverse stakeholders may demand differing fairness requirements over time, so retraining for different fairness requirements becomes prohibitive. To address this limitation, we propose Cofair, a single-train framework that enables post-training fairness control in recommendation. Specifically, Cofair introduces a shared representation layer with fairness-conditioned adapter modules to produce user embeddings specialized for varied fairness levels, along with a user-level regularization term that guarantees user-wise monotonic fairness improvements across these levels. We theoretically establish that the adversarial objective of Cofair upper bounds demographic parity and the regularization term enforces progressive fairness at user level. Comprehensive experiments on multiple datasets and backbone models demonstrate that our framework provides dynamic fairness at different levels, delivering comparable or better fairness-accuracy curves than state-of-the-art baselines, without the need to retrain for each new fairness requirement. Our code is publicly available at https://github.com/weixinchen98/Cofair.
Read more →

Exploring Transformer Placement in Variational Autoencoders for Tabular Data Generation

arXiv:2601.20854v1 Announce Type: cross Abstract: Tabular data remains a challenging domain for generative models. In particular, the standard Variational Autoencoder (VAE) architecture, typically composed of multilayer perceptrons, struggles to model relationships between features, especially when handling mixed data types. In contrast, Transformers, through their attention mechanism, are better suited for capturing complex feature interactions. In this paper, we empirically investigate the impact of integrating Transformers into different components of a VAE. We conduct experiments on 57 datasets from the OpenML CC18 suite and draw two main conclusions. First, results indicate that positioning Transformers to leverage latent and decoder representations leads to a trade-off between fidelity and diversity. Second, we observe a high similarity between consecutive blocks of a Transformer in all components. In particular, in the decoder, the relationship between the input and output of a Transformer is approximately linear.
Read more →

Evolutionary Strategies lead to Catastrophic Forgetting in LLMs

arXiv:2601.20861v1 Announce Type: cross Abstract: One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art LLMs. Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms and have shown encouraging performance on specific tasks in LLMs. In this paper, we perform a comprehensive analysis of ES and specifically evaluate its forgetting curves when training for an increasing number of update steps. We first find that ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget. However, and most importantly for continual learning, the performance gains in ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online. We also explore the reason behind this behavior and show that the updates made using ES are much less sparse and have orders of magnitude larger $\ell_2$ norm compared to corresponding GRPO updates, explaining the contrasting forgetting curves between the two algorithms. With this study, we aim to highlight the issue of forgetting in gradient-free algorithms like ES and hope to inspire future work to mitigate these issues.
Read more →

SimBench: A Framework for Evaluating and Diagnosing LLM-Based Digital-Twin Generation for Multi-Physics Simulation

arXiv:2408.11987v2 Announce Type: replace Abstract: We introduce SimBench, a benchmark designed to evaluate the proficiency of simulator-oriented LLMs (S-LLMs) in generating digital twins (DTs) that can be used in simulators for virtual testing. Given a collection of S-LLMs, this benchmark ranks them according to their ability to produce high-quality DTs. We demonstrate this by comparing over 33 open- and closed-source S-LLMs. Using multi-turn interactions, SimBench employs an LLM-as-a-judge (J-LLM) that leverages both predefined rules and human-in-the-loop guidance to assign scores for the DTs generated by the S-LLM, thus providing a consistent and expert-inspired evaluation protocol. The J-LLM is specific to a simulator, and herein the proposed benchmarking approach is demonstrated in conjunction with the open-sourceChrono multi-physics simulator. Chrono provided the backdrop used to assess an S-LLM in relation to the latter's ability to create digital twins for multibody dynamics, finite element analysis, vehicle dynamics, robotic dynamics, and sensor simulations. The proposed benchmarking principle is broadly applicable and enables the assessment of an S-LLM's ability to generate digital twins for other simulation packages, e.g., ANSYS, ABAQUS, OpenFOAM, StarCCM+, IsaacSim, and pyBullet.
Read more →

DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud Systems

arXiv:2505.19847v2 Announce Type: replace Abstract: Retrieval-Augmented Generation (RAG) improves factuality by grounding LLMs in external knowledge, yet conventional centralized RAG requires aggregating distributed data, raising privacy risks and incurring high retrieval latency and cost. We present DGRAG, a distributed graph-driven RAG framework for edge-cloud collaborative systems. Each edge device organizes local documents into a knowledge graph and periodically uploads subgraph-level summaries to the cloud for lightweight global indexing without exposing raw data. At inference time, queries are first answered on the edge; a gate mechanism assesses the confidence and consistency of multiple local generations to decide whether to return a local answer or escalate the query. For escalated queries, the cloud performs summary-based matching to identify relevant edges, retrieves supporting evidence from them, and generates the final response with a cloud LLM. Experiments on distributed question answering show that DGRAG consistently outperforms decentralized baselines while substantially reducing cloud overhead.
Read more →

Lifted Forward Planning in Relational Factored Markov Decision Processes with Concurrent Actions

arXiv:2505.22147v2 Announce Type: replace Abstract: Decision making is a central problem in AI that can be formalized using a Markov Decision Process. A problem is that, with increasing numbers of (indistinguishable) objects, the state space grows exponentially. To compute policies, the state space has to be enumerated. Even more possibilities have to be enumerated if the size of the action space depends on the size of the state space, especially if we allow concurrent actions. To tackle the exponential blow-up in the action and state space, we present a first-order representation to store the spaces in polynomial instead of exponential size in the number of objects and introduce Foreplan, a relational forward planner, which uses this representation to efficiently compute policies for numerous indistinguishable objects and actions. Additionally, we introduce an even faster approximate version of Foreplan. Moreover, Foreplan identifies how many objects an agent should act on to achieve a certain task given restrictions. Further, we provide a theoretical analysis and an empirical evaluation of Foreplan, demonstrating a speedup of at least four orders of magnitude.
Read more →

DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems

arXiv:2506.06052v3 Announce Type: replace Abstract: Discrete Combinatorial Problems (DCPs) are prevalent in industrial decision-making and optimisation. However, while constraint solving technologies for DCPs have advanced significantly, the core process of formalising them, namely constraint modelling, requires significant expertise and remains a bottleneck for wider adoption. Aiming to alleviate this bottleneck, recent studies have explored using Large Language Models (LLMs) to transform combinatorial problem descriptions into executable constraint models. However, the existing evaluation datasets for discrete constraint modelling are often limited to small, homogeneous, or domain-specific problems, which do not capture the diversity of real-world scenarios. This work addresses this gap by introducing DCP-Bench-Open, a novel benchmark that includes a diverse set of well-known discrete combinatorial problems sourced from the Constraint Programming (CP) and Operations Research (OR) communities, structured explicitly for evaluating LLM-driven constraint modelling. With this dataset, and given the variety of modelling frameworks, we compare and evaluate the modelling capabilities of LLMs for three distinct constraint modelling systems, which vary in abstraction level and underlying syntax. Notably, the results show higher performance when modelling with a high-level Python-based framework. Additionally, we systematically evaluate the use of prompt-based and inference-time compute methods across different LLMs, which further increase accuracy, reaching up to 91% on this highly challenging benchmark. DCP-Bench-Open is publicly available.
Read more →

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

arXiv:2506.10912v3 Announce Type: replace Abstract: Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.
Read more →

Beyond Syntax: Action Semantics Learning for App Agents

arXiv:2506.17697v2 Announce Type: replace Abstract: The recent development of Large Language Models (LLMs) enables the rise of App agents that interpret user intent and operate smartphone Apps through actions such as clicking and scrolling. While prompt-based solutions with proprietary LLM APIs show promising ability, they incur heavy compute costs and external API dependency. Fine-tuning smaller open-source LLMs solves these limitations. However, current supervised fine-tuning methods use a syntax learning paradigm that forces agents to reproduce exactly the ground truth action strings, leading to out-of-distribution (OOD) vulnerability. To fill this gap, we propose Action Semantics Learning (ASL), a novel learning framework, where the learning objective is capturing the semantics of the ground truth actions. Specifically, inspired by the programming language theory, we define the action semantics for App agents as the state transition induced by the action in the user interface. Building on this insight, ASL employs a novel SEmantic Estimator~(SEE) to compute a semantic similarity to train the App agents in generating actions aligned with the semantics of ground truth actions, even when their syntactic forms differ. SEE is a flexible module that can be applied in both supervised and reinforcement fine-tuning paradigms. To support the effectiveness of ASL, we theoretically demonstrate the superior robustness of ASL for the OOD problem compared with the existing syntax learning paradigm. Extensive experiments across multiple offline and online benchmarks demonstrate that ASL significantly improves the accuracy and generalisation of App agents compared to existing methods.
Read more →

Mind the Gap: The Divergence Between Human and LLM-Generated Tasks

arXiv:2508.00282v3 Announce Type: replace Abstract: Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task-generation experiment comparing human responses with those of an LLM agent (GPT-4o). We find that human task generation is consistently influenced by psychological drivers, including personal values (e.g., Openness to Change) and cognitive style. Even when these psychological drivers are explicitly provided to the LLM, it fails to reflect the corresponding behavioral patterns. They produce tasks that are markedly less social, less physical, and thematically biased toward abstraction. Interestingly, while the LLM's tasks were perceived as more fun and novel, this highlights a disconnect between its linguistic proficiency and its capacity to generate human-like, embodied goals. We conclude that there is a core gap between the value-driven, embodied nature of human cognition and the statistical patterns of LLMs, highlighting the necessity of incorporating intrinsic motivation and physical grounding into the design of more human-aligned agents.
Read more →

A Message Passing Realization of Expected Free Energy Minimization

arXiv:2508.02197v2 Announce Type: replace Abstract: We present a message passing approach to Expected Free Energy (EFE) minimization on factor graphs, based on the theory introduced in arXiv:2504.14898. By reformulating EFE minimization as Variational Free Energy minimization with epistemic priors, we transform a combinatorial search problem into a tractable inference problem solvable through standard variational techniques. Applying our message passing method to factorized state-space models enables efficient policy inference. We evaluate our method on environments with epistemic uncertainty: a stochastic gridworld and a partially observable Minigrid task. Agents using our approach consistently outperform conventional KL-control agents on these tasks, showing more robust planning and efficient exploration under uncertainty. In the stochastic gridworld environment, EFE-minimizing agents avoid risky paths, while in the partially observable minigrid setting, they conduct more systematic information-seeking. This approach bridges active inference theory with practical implementations, providing empirical evidence for the efficiency of epistemic priors in artificial agents.
Read more →

Robust Deep Monte Carlo Counterfactual Regret Minimization: Addressing Theoretical Risks in Neural Fictitious Self-Play

arXiv:2509.00923v2 Announce Type: replace Abstract: Monte Carlo Counterfactual Regret Minimization (MCCFR) has emerged as a cornerstone algorithm for solving extensive-form games, but its integration with deep neural networks introduces scale-dependent challenges that manifest differently across game complexities. This paper presents a comprehensive analysis of how neural MCCFR component effectiveness varies with game scale and proposes an adaptive framework for selective component deployment. We identify that theoretical risks such as nonstationary target distribution shifts, action support collapse, variance explosion, and warm-starting bias have scale-dependent manifestation patterns, requiring different mitigation strategies for small versus large games. Our proposed Robust Deep MCCFR framework incorporates target networks with delayed updates, uniform exploration mixing, variance-aware training objectives, and comprehensive diagnostic monitoring. Through systematic ablation studies on Kuhn and Leduc Poker, we demonstrate scale-dependent component effectiveness and identify critical component interactions. The best configuration achieves final exploitability of 0.0628 on Kuhn Poker, representing a 60% improvement over the classical framework (0.156). On the more complex Leduc Poker domain, selective component usage achieves exploitability of 0.2386, a 23.5% improvement over the classical framework (0.3703) and highlighting the importance of careful component selection over comprehensive mitigation. Our contributions include: (1) a formal theoretical analysis of risks in neural MCCFR, (2) a principled mitigation framework with convergence guarantees, (3) comprehensive multi-scale experimental validation revealing scale-dependent component interactions, and (4) practical guidelines for deployment in larger games.
Read more →

Analysis of approximate linear programming solution to Markov decision problem with log barrier function

arXiv:2509.19800v2 Announce Type: replace Abstract: There are two primary approaches to solving Markov decision problems (MDPs): dynamic programming based on the Bellman equation and linear programming (LP). Dynamic programming methods are the most widely used and form the foundation of both classical and modern reinforcement learning (RL). By contrast, LP-based methods have been less commonly employed, although they have recently gained attention in contexts such as offline RL. The relative underuse of the LP-based methods stems from the fact that it leads to an inequality-constrained optimization problem, which is generally more challenging to solve effectively compared with Bellman-equation-based methods. The purpose of this paper is to establish a theoretical foundation for solving LP-based MDPs in a more effective and practical manner. Our key idea is to leverage the log-barrier function, widely used in inequality-constrained optimization, to transform the LP formulation of the MDP into an unconstrained optimization problem. This reformulation enables approximate solutions to be obtained easily via gradient descent. While the method may appear simple, to the best of our knowledge, a thorough theoretical interpretation of this approach has not yet been developed. This paper aims to bridge this gap.
Read more →

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

arXiv:2509.23130v3 Announce Type: replace Abstract: Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of specifications. However, existing work mostly targets small code, not complete systems. It is unclear whether AI can deal with realistic system artifacts, as this requires abstracting their complex behavioral properties into formal models. We present SysMoBench, a benchmark that evaluates AI's ability to formally model large, complex systems. We focus on concurrent and distributed systems, which are keystones of today's critical computing infrastructures, encompassing operating systems and cloud infrastructure. We use TLA+, the de facto specification language for concurrent and distributed systems, though the benchmark can be extended to other specification languages. We address the primary challenge of evaluating AI-generated models by automating metrics like syntactic and runtime correctness, conformance to system code, and invariant correctness. SysMoBench currently includes eleven diverse system artifacts: the Raft implementation of Etcd and Redis, the leader election of ZooKeeper, the Spinlock, Mutex, and Ringbuffer in Asterinas OS, etc., with more being added. SysMoBench enables us to understand the capabilities and limitations of today's LLMs and agents, putting tools in this area on a firm footing and opening up promising new research directions.
Read more →

FourierCSP: Differentiable Constraint Satisfaction Problem Solving by Walsh-Fourier Expansion

arXiv:2510.04480v2 Announce Type: replace Abstract: The Constraint-satisfaction problem (CSP) is fundamental in mathematics, physics, and theoretical computer science. Continuous local search (CLS) solvers, as recent advancements, can achieve highly competitive results on certain classes of Boolean satisfiability (SAT) problems. Motivated by these advances, we extend the CLS framework from Boolean SAT to general CSP with finite-domain variables and expressive constraint formulations. We present FourierCSP, a continuous optimization framework that generalizes the Walsh-Fourier transform to CSP, allowing for transforming versatile constraints to compact multilinear polynomials, thereby avoiding the need for auxiliary variables and memory-intensive encodings. We employ projected subgradient and mirror descent algorithms with provable convergence guarantees, and further combine them to accelerate gradient-based optimization. Empirical results on benchmark suites demonstrate that FourierCSP is scalable and competitive, significantly broadening the class of problems that can be efficiently solved by differentiable CLS techniques and paving the way toward end-to-end neurosymbolic integration.
Read more →

MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption

arXiv:2510.05580v3 Announce Type: replace Abstract: Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.
Read more →

Cognition Envelopes for Bounded AI Reasoning in Autonomous UAS Operations

arXiv:2510.26905v2 Announce Type: replace Abstract: Cyber-physical systems increasingly rely on Foundational Models such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, overgeneralizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance.
Read more →

Neural Value Iteration

arXiv:2511.08825v2 Announce Type: replace Abstract: The value function of a POMDP exhibits the piecewise-linear-convex (PWLC) property and can be represented as a finite set of hyperplanes, known as $\alpha$-vectors. Most state-of-the-art POMDP solvers (offline planners) follow the point-based value iteration scheme, which performs Bellman backups on $\alpha$-vectors at reachable belief points until convergence. However, since each $\alpha$-vector is $|S|$-dimensional, these methods quickly become intractable for large-scale problems due to the prohibitive computational cost of Bellman backups. In this work, we demonstrate that the PWLC property allows a POMDP's value function to be alternatively represented as a finite set of neural networks. This insight enables a novel POMDP planning algorithm called \emph{Neural Value Iteration}, which combines the generalization capability of neural networks with the classical value iteration framework. Our approach achieves near-optimal solutions even in extremely large POMDPs that are intractable for existing offline solvers.
Read more →

AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics

arXiv:2511.09785v2 Announce Type: replace Abstract: Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility. We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse. Using transcripts from 30 one-to-one math sessions, we compare three production LLMs (GPT, Claude, Gemini) under three conditions: unverified annotation, self-verification, and cross-verification across all orchestration configurations. Outputs are benchmarked against a blinded, disagreement-focused human adjudication using Cohen's kappa. Overall, orchestration yields a 58 percent improvement in kappa. Self-verification nearly doubles agreement relative to unverified baselines, with the largest gains for challenging tutor moves. Cross-verification achieves a 37 percent improvement on average, with pair- and construct-dependent effects: some verifier-annotator pairs exceed self-verification, while others reduce alignment, reflecting differences in verifier strictness. We contribute: (1) a flexible orchestration framework instantiating control, self-, and cross-verification; (2) an empirical comparison across frontier LLMs on authentic tutoring data with blinded human "gold" labels; and (3) a concise notation, verifier(annotator) (e.g., Gemini(GPT) or Claude(Claude)), to standardize reporting and make directional effects explicit for replication. Results position verification as a principled design lever for reliable, scalable LLM-assisted annotation in Learning Analytics.
Read more →

Quantifying Fidelity: A Decisive Feature Approach to Comparing Synthetic and Real Imagery

arXiv:2512.16468v3 Announce Type: replace Abstract: Virtual testing using synthetic data has become a cornerstone of autonomous vehicle (AV) safety assurance. Despite progress in improving visual realism through advanced simulators and generative AI, recent studies reveal that pixel-level fidelity alone does not ensure reliable transfer from simulation to the real world. What truly matters is whether the system-under-test (SUT) bases its decisions on consistent decision evidence in both real and simulated environments, not just whether images "look real" to humans. To this end this paper proposes a behavior-grounded fidelity measure by introducing Decisive Feature Fidelity (DFF), a new SUT-specific metric that extends the existing fidelity spectrum to capture mechanism parity, that is, agreement in the model-specific decisive evidence that drives the SUT's decisions across domains. DFF leverages explainable-AI methods to identify and compare the decisive features driving the SUT's outputs for matched real-synthetic pairs. We further propose estimators based on counterfactual explanations, along with a DFF-guided calibration scheme to enhance simulator fidelity. Experiments on 2126 matched KITTI-VirtualKITTI2 pairs demonstrate that DFF reveals discrepancies overlooked by conventional output-value fidelity. Furthermore, results show that DFF-guided calibration improves decisive-feature and input-level fidelity without sacrificing output value fidelity across diverse SUTs.
Read more →

Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models

arXiv:2512.18901v3 Announce Type: replace Abstract: We present Gabliteration, a novel neural weight modification technique that advances beyond traditional abliteration methods by implementing adaptive multi-directional projections with regularized layer selection. Our approach addresses the fundamental limitation of existing methods that compromise model quality while attempting to modify specific behavioral patterns. Through dynamic layer optimization, regularized projection matrices, and adaptive scaling mechanisms, we achieve theoretically superior weight modification while minimizing quality degradation in unrelated domains. We validate our method through the gabliterated-v1 model series (0.6B to 4B parameters) available on Hugging Face, demonstrating practical applicability across multiple model scales.
Read more →

CASCADE: Cumulative Agentic Skill Creation through Autonomous Development and Evolution

arXiv:2512.23880v2 Announce Type: replace Abstract: Large language model (LLM) agents currently depend on predefined tools or early-stage tool generation, limiting their adaptability and scalability to complex scientific tasks. We introduce CASCADE, a self-evolving agentic framework representing an early instantiation of the transition from "LLM + tool use" to "LLM + skill acquisition". CASCADE enables agents to master complex external tools and codify knowledge through two meta-skills: continuous learning via web search, code extraction, and memory utilization; self-reflection via introspection, knowledge graph exploration, and others. We evaluate CASCADE on SciSkillBench, a benchmark of 116 materials science and chemistry research tasks. CASCADE achieves a 93.3% success rate using GPT-5, compared to 35.4% without evolution mechanisms. We further demonstrate real-world applications in computational analysis, autonomous laboratory experiments, and selective reproduction of published papers. Along with human-agent collaboration and memory consolidation, CASCADE accumulates executable skills that can be shared across agents and scientists, moving toward scalable AI-assisted scientific research.
Read more →

Recursive Language Models

arXiv:2512.24601v2 Announce Type: replace Abstract: We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context scaffolds across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first natively recursive language model. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.
Read more →

SimpleMem: Efficient Lifelong Memory for LLM Agents

arXiv:2601.02553v2 Announce Type: replace Abstract: To support long-term interaction in complex environments, LLM agents require memory systems that manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which distills unstructured interactions into compact, multi-view indexed memory units; (2) Online Semantic Synthesis, an intra-session process that instantly integrates related context into unified abstract representations to eliminate redundancy; and (3) Intent-Aware Retrieval Planning, which infers search intent to dynamically determine retrieval scope and construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.
Read more →

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

arXiv:2601.08430v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale ($\sim$110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. Our code is available at \href{https://github.com/teqkilla/RubricHub}{ this URL}.
Read more →

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

arXiv:2601.10402v3 Announce Type: replace Abstract: The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanning days or weeks. While Large Language Models (LLMs) have demonstrated prowess in short-horizon reasoning, they are easily overwhelmed by execution details in the high-dimensional, delayed-feedback environments of real-world research, failing to consolidate sparse feedback into coherent long-term guidance. Here, we present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE) which is a representative microcosm of scientific discovery. By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC), a multi-tiered architecture inspired by computer systems that enables the structural differentiation of experience over time. By dynamically distilling transient execution traces into stable knowledge and cross-task wisdom, HCC allows agents to decouple immediate execution from long-term experimental strategy, effectively overcoming the scaling limits of static context windows. In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%. Our findings demonstrate that ultra-long-horizon autonomy provides a scalable blueprint for AI capable of autonomous exploration beyond human-precedent complexities.
Read more →

Actionable Interpretability Must Be Defined in Terms of Symmetries

arXiv:2601.12913v2 Announce Type: replace Abstract: This paper argues that interpretability research in Artificial Intelligence (AI) is fundamentally ill-posed as existing definitions of interpretability fail to describe how interpretability can be formally tested or designed for. We posit that actionable definitions of interpretability must be formulated in terms of *symmetries* that inform model design and lead to testable conditions. Under a probabilistic view, we hypothesise that four symmetries (inference equivariance, information invariance, concept-closure invariance, and structural invariance) suffice to (i) formalise interpretable models as a subclass of probabilistic models, (ii) yield a unified formulation of interpretable inference (e.g., alignment, interventions, and counterfactuals) as a form of Bayesian inversion, and (iii) provide a formal framework to verify compliance with safety standards and regulations.
Read more →

Epistemic Constitutionalism Or: how to avoid coherence bias

arXiv:2601.14295v2 Announce Type: replace Abstract: Large language models increasingly function as artificial reasoners: they evaluate arguments, assign credibility, and express confidence. Yet their belief-forming behavior is governed by implicit, uninspected epistemic policies. This paper argues for an epistemic constitution for AI: explicit, contestable meta-norms that regulate how systems form and express beliefs. Source attribution bias provides the motivating case: I show that frontier models enforce identity-stance coherence, penalizing arguments attributed to sources whose expected ideological position conflicts with the argument's content. When models detect systematic testing, these effects collapse, revealing that systems treat source-sensitivity as bias to suppress rather than as a capacity to execute well. I distinguish two constitutional approaches: the Platonic, which mandates formal correctness and default source-independence from a privileged standpoint, and the Liberal, which refuses such privilege, specifying procedural norms that protect conditions for collective inquiry while allowing principled source-attending grounded in epistemic vigilance. I argue for the Liberal approach, sketch a constitutional core of eight principles and four orientations, and propose that AI epistemic governance requires the same explicit, contestable structure we now expect for AI ethics.
Read more →

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

arXiv:2601.18631v2 Announce Type: replace Abstract: When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbf{AdaReasoner}, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9\% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.
Read more →

Neural Theorem Proving for Verification Conditions: A Real-World Benchmark

arXiv:2601.18944v2 Announce Type: replace Abstract: Theorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers (ATPs) cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verification--particularly VC proving--remains largely unexplored. Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving. This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC), presenting the first real-world multi-language benchmark for this task. From real-world projects such as Linux and Contiki-OS kernel, our benchmark leverages industrial pipelines (Why3 and Frama-C) to generate semantically equivalent test cases across formal languages of Isabelle, Lean, and Rocq. We evaluate large language models (LLMs), both general-purpose and those fine-tuned for theorem proving, on NTP4VC. Results indicate that although LLMs show promise in VC proving, significant challenges remain for program verification, highlighting a large gap and opportunity for future research.
Read more →

Membership Privacy Risks of Sharpness Aware Minimization

arXiv:2310.00488v4 Announce Type: replace-cross Abstract: Optimization algorithms that seek flatter minima, such as Sharpness-Aware Minimization (SAM), are credited with improved generalization and robustness to noise. We ask whether such gains impact membership privacy. Surprisingly, we find that SAM is more prone to Membership Inference Attacks (MIA) than classical SGD across multiple datasets and attack methods, despite achieving lower test error. This suggests that the geometric mechanism of SAM that improves generalization simultaneously exacerbates membership leakage. We investigate this phenomenon through extensive analysis of memorization and influence scores. Our results reveal that SAM is more capable of capturing atypical subpatterns, leading to higher memorization scores of samples. Conversely, SGD depends more heavily on majority features, exhibiting worse generalization on atypical subgroups and lower memorization. Crucially, this characteristic of SAM can be linked to lower variance in the prediction confidence of unseen samples, thereby amplifying membership signals. Finally, we model SAM under a perfectly interpolating linear regime and theoretically show that sharpness regularization inherently reduces variance, guaranteeing a higher MIA advantage for confidence and likelihood ratio attacks.
Read more →

UDEEP: Edge-based Computer Vision for In-Situ Underwater Crayfish and Plastic Detection

arXiv:2401.06157v2 Announce Type: replace-cross Abstract: Invasive signal crayfish have a detrimental impact on ecosystems. They spread the fungal-type crayfish plague disease (Aphanomyces astaci) that is lethal to the native white clawed crayfish, the only native crayfish species in Britain. Invasive signal crayfish extensively burrow, causing habitat destruction, erosion of river banks and adverse changes in water quality, while also competing with native species for resources leading to declines in native populations. Moreover, pollution exacerbates the vulnerability of White-clawed crayfish, with their populations declining by over 90%. To safeguard aquatic ecosystems, it is imperative to address the challenges posed by invasive species and pollution in aquatic ecosystem's. This article introduces the Cognitive Edge Device (CED) computing platform for the detection of crayfish and plastic. It also presents two publicly available underwater datasets, annotated with sequences of crayfish and aquatic plastic debris. Four You Only Look Once (YOLO) variants were trained and evaluated for crayfish and plastic object detection. YOLOv5s achieved the highest detection accuracy, with an mAP@0.5 of 0.90, and achieved the best precision
Read more →

LLM Multi-Agent Systems: Challenges and Open Problems

arXiv:2402.03578v3 Announce Type: replace-cross Abstract: This paper explores multi-agent systems and identify challenges that remain inadequately addressed. By leveraging the diverse capabilities and roles of individual agents, multi-agent systems can tackle complex tasks through agent collaboration. We discuss optimizing task allocation, fostering robust reasoning through iterative debates, managing complex and layered context information, and enhancing memory management to support the intricate interactions within multi-agent systems. We also explore potential applications of multi-agent systems in blockchain systems to shed light on their future development and application in real-world distributed systems.
Read more →

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

arXiv:2402.15769v3 Announce Type: replace-cross Abstract: Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data lacks study in this field. In this paper, we introduce a generic data augmentation framework, GenCode, to enhance the training of code understanding models. Simply speaking, GenCode follows a generation-and-selection paradigm to prepare useful training code data. Specifically, it employs code augmentation techniques to generate new code candidates first and then identifies important ones as the training data by influence scores. To evaluate the effectiveness of GenCode, we conduct experiments on four code understanding tasks (e.g., code clone detection) and three pre-trained code models (e.g., CodeT5) and two recent released code-specific Large Language Models (LLMs) (e.g., Qwen2.5-Coder). Compared to the state-of-the-art (SOTA) code augmentation method MixCode, GenCode produces pre-trained code models with 2.92% higher accuracy and 4.90% adversarial robustness on average. For code-specific LLMs, GenCode achieves an average improvement of 0.93% in accuracy and 0.98% in natural robustness.
Read more →

An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases

arXiv:2407.10853v4 Announce Type: replace-cross Abstract: Bias and fairness risks in Large Language Models (LLMs) vary substantially across deployment contexts, yet existing approaches lack systematic guidance for selecting appropriate evaluation metrics. We present a decision framework that maps LLM use cases, characterized by a model and population of prompts, to relevant bias and fairness metrics based on task type, whether prompts contain protected attribute mentions, and stakeholder priorities. Our framework addresses toxicity, stereotyping, counterfactual unfairness, and allocational harms, and introduces novel metrics based on stereotype classifiers and counterfactual adaptations of text similarity measures. All metrics require only LLM outputs for computation, simplifying implementation while avoiding embedding-based approaches that often correlate poorly with downstream harms. We provide an open-source Python library, LangFair, for practical adoption. Extensive experiments demonstrate that fairness risks cannot be reliably assessed from benchmark performance alone: results on one prompt dataset likely overstate or understate risks for another, underscoring that fairness evaluation must be grounded in the specific deployment context.
Read more →

LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP

arXiv:2408.04628v2 Announce Type: replace-cross Abstract: Standard natural language processing (NLP) pipelines operate on symbolic representations of language, which typically consist of sequences of discrete tokens. However, creating an analogous representation for ancient logographic writing systems is an extremely labor intensive process that requires expert knowledge. At present, a large portion of logographic data persists in a purely visual form due to the absence of transcription -- this issue poses a bottleneck for researchers seeking to apply NLP toolkits to study ancient logographic languages: most of the relevant data are images of writing. This paper investigates whether direct processing of visual representations of language offers a potential solution. We introduce LogogramNLP, the first benchmark enabling NLP analysis of ancient logographic languages, featuring both transcribed and visual datasets for four writing systems along with annotations for tasks like classification, translation, and parsing. Our experiments compare systems that employ recent visual and text encoding strategies as backbones. The results demonstrate that visual representations outperform textual representations for some investigated tasks, suggesting that visual processing pipelines may unlock a large amount of cultural heritage data of logographic languages for NLP-based analyses.
Read more →

Helping Johnny Make Sense of Privacy Policies with LLMs

arXiv:2501.16033v2 Announce Type: replace-cross Abstract: Understanding and engaging with privacy policies is crucial for online privacy, yet these documents remain notoriously complex and difficult to navigate. We present PRISMe, an interactive browser extension that combines LLM-based policy assessment with a dashboard and customizable chat interface, enabling users to skim quick overviews or explore policy details in depth while browsing. We conduct a user study (N=22) with participants of diverse privacy knowledge to investigate how users interpret the tool's explanations and how it shapes their engagement with privacy policies, identifying distinct interaction patterns. Participants valued the clear overviews and conversational depth, but flagged some issues, particularly adversarial robustness and hallucination risks. Thus, we investigate how a retrieval-augmented generation (RAG) approach can alleviate issues by re-running the chat queries from the study. Our findings surface design challenges as well as technical trade-offs, contributing actionable insights for developing future user-centered, trustworthy privacy policy analysis tools.
Read more →

Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning

arXiv:2501.19180v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are vital for a wide range of applications yet remain susceptible to jailbreak threats, which could lead to the generation of inappropriate responses. Conventional defenses, such as refusal and adversarial training, often fail to cover corner cases or rare domains, leaving LLMs still vulnerable to more sophisticated attacks. We propose a novel defense strategy, Safety Chain-of-Thought (SCoT), which harnesses the enhanced \textit{reasoning capabilities} of LLMs for proactive assessment of harmful inputs, rather than simply blocking them. SCoT augments any refusal training datasets to critically analyze the intent behind each request before generating answers. By employing proactive reasoning, SCoT enhances the generalization of LLMs across varied harmful queries and scenarios not covered in the safety alignment corpus. Additionally, it generates detailed refusals specifying the rules violated. Comparative evaluations show that SCoT significantly surpasses existing defenses, reducing vulnerability to out-of-distribution issues and adversarial manipulations while maintaining strong general capabilities.
Read more →

Mitigating Sensitive Information Leakage in LLMs4Code through Machine Unlearning

arXiv:2502.05739v2 Announce Type: replace-cross Abstract: Large Language Models for Code (LLMs4Code) have achieved strong performance in code generation, but recent studies reveal that they may memorize and leak sensitive information contained in training data, posing serious privacy risks. To address this gap, this work presents the first comprehensive empirical study on applying machine unlearning to mitigate sensitive information leakage in LLMs4Code. We first construct a dedicated benchmark that includes: (i) a synthetic forget set containing diverse forms of personal information, and (ii) a retain set designed to evaluate whether code-generation capability is preserved after unlearning. Using this benchmark, we systematically assess three representative unlearning algorithms (GA, GA+GD, GA+KL) across three widely used open-source LLMs4Code models (AIXCoder-7B, CodeLlama-7B, CodeQwen-7B). Experimental results demonstrate that machine unlearning can substantially reduce direct memorization-based leakage: on average, the direct leak rate drops by more than 50% while retaining about over 91% of the original code-generation performance. Moreover, by analyzing post-unlearning outputs, we uncover a consistent shift from direct to indirect leakage, revealing an underexplored vulnerability that persists even when the target data has been successfully forgotten. Our findings show that machine unlearning is a feasible and effective solution for enhancing privacy protection in LLMs4Code, while also highlighting the need for future techniques capable of mitigating both direct and indirect leakage simultaneously.
Read more →

Randomly Wrong Signals: Bayesian Auction Design with ML Predictions

arXiv:2502.08792v3 Announce Type: replace-cross Abstract: We study auction design when a seller relies on machine-learning predictions of bidders' valuations that may be unreliable. Motivated by modern ML systems that are often accurate but occasionally fail in a way that is essentially uninformative, we model predictions as randomly wrong: with high probability the signal equals the bidder's true value, and otherwise it is a hallucination independent of the value. We analyze revenue-maximizing auctions when the seller publicly reveals these signals. A central difficulty is that the resulting posterior belief combines a continuous distribution with a point mass at the signal, so standard Myerson techniques do not directly apply. We provide a tractable characterization of the optimal signal-revealing auction by providing a closed-form characterization of the appropriate ironed virtual values. This characterization yields simple and intuitive implications. With a single bidder, the optimal mechanism reduces to a posted-price policy with a small number of regimes: the seller ignores low signals, follows intermediate signals, caps moderately high signals, and may again follow very high signals. With multiple bidders, we show that a simple eager second-price auction with signal-dependent reserve prices performs nearly optimally in numerical experiments and substantially outperforms natural benchmarks that either ignore the signal or treat it as fully reliable.
Read more →

MAnchors: Memorization-Based Acceleration of Anchors via Rule Reuse and Transformation

arXiv:2502.11068v2 Announce Type: replace-cross Abstract: Anchors is a popular local model-agnostic explanation technique whose applicability is limited by its computational inefficiency. To address this limitation, we propose a memorization-based framework that accelerates Anchors while preserving explanation fidelity and interpretability. Our approach leverages the iterative nature of Anchors' algorithm which gradually refines an explanation until it is precise enough for a given input by storing and reusing intermediate results obtained during prior explanations. Specifically, we maintain a memory of low-precision, high-coverage rules and introduce a rule transformation framework to adapt them to new inputs: the horizontal transformation adapts a pre-trained explanation to the current input by replacing features, and the vertical transformation refines the general explanation until it is precise enough for the input. We evaluate our method across tabular, text, and image datasets, demonstrating that it significantly reduces explanation generation time while maintaining fidelity and interpretability, thereby enabling the practical adoption of Anchors in time-sensitive applications.
Read more →

BAGEL: Projection-Free Algorithm for Adversarially Constrained Online Convex Optimization

arXiv:2502.16744v2 Announce Type: replace-cross Abstract: Projection-based algorithms for Constrained Online Convex Optimization (COCO) achieve optimal $\mathcal{O}(T^{1/2})$ regret guarantees but face scalability challenges due to the computational complexity of projections. To circumvent this, projection-free methods utilizing Linear Optimization Oracles (LOO) have been proposed, albeit typically achieving slower $\mathcal{O}(T^{3/4})$ regret rates. In this work, we examine whether the $\mathcal{O}(T^{1/2})$ rate can be recovered in the projection-free setting by strengthening the oracle assumption. We introduce BAGEL, an algorithm utilizing a Separation Oracle (SO) that achieves $\mathcal{O}(T^{1/2})$ regret and $\tilde{\mathcal{O}}(T^{1/2})$ cumulative constraint violation (CCV) for convex cost functions. Our analysis shows that by leveraging an infeasible projection via SO, we can match the time-horizon dependence of projection-based methods with $\tilde{\mathcal{O}}(T)$ oracle calls, provided dependence on the geometry of the action set. This establishes a specific regime where projection-free methods can attain the same convergence rates as projection-based counterparts.
Read more →

Compositional Reasoning with Transformers, RNNs, and Chain of Thought

arXiv:2503.01544v2 Announce Type: replace-cross Abstract: It is well understood that different neural network architectures are suited to different tasks, but is there always a single best architecture for a given task? We compare the expressive power of transformers, RNNs, and transformers with chain of thought tokens on a simple and natural class of tasks we term Compositional Reasoning Questions (CRQ). This family captures multi-step problems with tree-like compositional structure, such as evaluating Boolean formulas. We prove that under standard hardness assumptions, \emph{none} of these three architectures is capable of solving CRQs unless some hyperparameter (depth, embedding dimension, and number of chain of thought tokens, respectively) grows with the size of the input. We then provide constructions for solving CRQs with each architecture. For transformers, our construction uses depth that is logarithmic in the problem size. For RNNs, logarithmic embedding dimension is necessary and sufficient, so long as the inputs are provided in a certain order. For transformers with chain of thought, our construction uses $n$ CoT tokens for input size $n$. These results show that, while CRQs are inherently hard, there are several different ways for language models to overcome this hardness. Even for a single class of problems, each architecture has strengths and weaknesses, and none is strictly better than the others.
Read more →

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

arXiv:2503.01805v2 Announce Type: replace-cross Abstract: Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.
Read more →

Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

arXiv:2503.02623v5 Announce Type: replace-cross Abstract: A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We propose a novel Reinforcement Learning approach that allows to directly fine-tune LLMs to express calibrated confidence estimates alongside their answers to factual questions. Our method optimizes a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence. This encourages the model to align its confidence estimates with the actual predictive accuracy. The optimal policy under our reward design would result in perfectly calibrated confidence expressions. Unlike prior approaches that decouple confidence estimation from response generation, our method integrates confidence calibration seamlessly into the generative process of the LLM. Empirically, we demonstrate that models trained with our approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting the emergence of general confidence awareness.
Read more →

Diffusion Generative Recommendation with Continuous Tokens

arXiv:2504.12007v4 Announce Type: replace-cross Abstract: Recent advances in generative artificial intelligence, particularly large language models (LLMs), have opened new opportunities for enhancing recommender systems (RecSys). Most existing LLM-based RecSys approaches operate in a discrete space, using vector-quantized tokenizers to align with the inherent discrete nature of language models. However, these quantization methods often result in lossy tokenization and suboptimal learning, primarily due to inaccurate gradient propagation caused by the non-differentiable argmin operation in standard vector quantization. Inspired by the emerging trend of embracing continuous tokens in language models, we propose ContRec, a novel framework that seamlessly integrates continuous tokens into LLM-based RecSys. Specifically, ContRec consists of two key modules: a sigma-VAE Tokenizer, which encodes users/items with continuous tokens; and a Dispersive Diffusion module, which captures implicit user preference. The tokenizer is trained with a continuous Variational Auto-Encoder (VAE) objective, where three effective techniques are adopted to avoid representation collapse. By conditioning on the previously generated tokens of the LLM backbone during user modeling, the Dispersive Diffusion module performs a conditional diffusion process with a novel Dispersive Loss, enabling high-quality user preference generation through next-token diffusion. Finally, ContRec leverages both the textual reasoning output from the LLM and the latent representations produced by the diffusion model for Top-K item retrieval, thereby delivering comprehensive recommendation results. Extensive experiments on four datasets demonstrate that ContRec consistently outperforms both traditional and SOTA LLM-based recommender systems. Our results highlight the potential of continuous tokenization and generative modeling for advancing the next generation of recommender systems.
Read more →

Physics-Guided Multimodal Transformers are the Necessary Foundation for the Next Generation of Meteorological Science

arXiv:2504.14174v2 Announce Type: replace-cross Abstract: This position paper argues that the next generation of artificial intelligence in meteorological and climate sciences must transition from fragmented hybrid heuristics toward a unified paradigm of physics-guided multimodal transformers. While purely data-driven models have achieved significant gains in predictive accuracy, they often treat atmospheric processes as mere visual patterns, frequently producing results that lack scientific consistency or violate fundamental physical laws. We contend that current ``hybrid'' attempts to bridge this gap remain ad-hoc and struggle to scale across the heterogeneous nature of meteorological data ranging from satellite imagery to sparse sensor measurements. We argue that the transformer architecture, through its inherent capacity for cross-modal alignment, provides the only viable foundation for a systematic integration of domain knowledge via physical constraint embedding and physics-informed loss functions. By advocating for this unified architectural shift, we aim to steer the community away from ``black-box'' curve fitting and toward AI systems that are inherently falsifiable, scientifically grounded, and robust enough to address the existential challenges of extreme weather and climate change.
Read more →

NoWag: A Unified Framework for Shape Preserving Compression of Large Language Models

arXiv:2504.14569v5 Announce Type: replace-cross Abstract: Large language models (LLMs) exhibit remarkable performance across various natural language processing tasks but suffer from immense computational and memory demands, limiting their deployment in resource-constrained environments. To address this challenge, we propose NoWag (Normalized Weight and Activation Guided Compression), a unified framework for one-shot shape preserving compression algorithms. We apply NoWag to compress Llama-2 (7B, 13B, 70B) and Llama-3 (8B, 70B) models using two popular shape-preserving techniques: vector quantization (NoWag-VQ) and unstructured/semi-structured pruning (NoWag-P). Our results show that NoWag-VQ significantly outperforms state-of-the-art one-shot vector quantization methods, while NoWag-P performs competitively against leading pruning techniques. These findings highlight underlying commonalities between these compression paradigms and suggest promising directions for future research. Our code is available at https://github.com/LawrenceRLiu/NoWag
Read more →

Doxing via the Lens: Revealing Location-related Privacy Leakage on Multi-modal Large Reasoning Models

arXiv:2504.19373v4 Announce Type: replace-cross Abstract: Recent advances in multi-modal large reasoning models (MLRMs) have shown significant ability to interpret complex visual content. While these models enable impressive reasoning capabilities, they also introduce novel and underexplored privacy risks. In this paper, we identify a novel category of privacy leakage in MLRMs: Adversaries can infer sensitive geolocation information, such as a user's home address or neighborhood, from user-generated images, including selfies captured in private settings. To formalize and evaluate these risks, we propose a three-level visual privacy risk framework that categorizes image content based on contextual sensitivity and potential for location inference. We further introduce DoxBench, a curated dataset of 500 real-world images reflecting diverse privacy scenarios. Our evaluation across 11 advanced MLRMs and MLLMs demonstrates that these models consistently outperform non-expert humans in geolocation inference and can effectively leak location-related private information. This significantly lowers the barrier for adversaries to obtain users' sensitive geolocation information. We further analyze and identify two primary factors contributing to this vulnerability: (1) MLRMs exhibit strong reasoning capabilities by leveraging visual clues in combination with their internal world knowledge; and (2) MLRMs frequently rely on privacy-related visual clues for inference without any built-in mechanisms to suppress or avoid such usage. To better understand and demonstrate real-world attack feasibility, we propose GeoMiner, a collaborative attack framework that decomposes the prediction process into two stages: clue extraction and reasoning to improve geolocation performance while introducing a novel attack perspective. Our findings highlight the urgent need to reassess inference-time privacy risks in MLRMs to better protect users' sensitive information.
Read more →

Field Matters: A Lightweight LLM-enhanced Method for CTR Prediction

arXiv:2505.14057v2 Announce Type: replace-cross Abstract: Click-through rate (CTR) prediction is a fundamental task in modern recommender systems. In recent years, the integration of large language models (LLMs) has been shown to effectively enhance the performance of traditional CTR methods. However, existing LLM-enhanced methods often require extensive processing of detailed textual descriptions for large-scale instances or user/item entities, leading to substantial computational overhead. To address this challenge, this work introduces LLaCTR, a novel and lightweight LLM-enhanced CTR method that employs a field-level enhancement paradigm. Specifically, LLaCTR first utilizes LLMs to distill crucial and lightweight semantic knowledge from small-scale feature fields through self-supervised field-feature fine-tuning. Subsequently, it leverages this field-level semantic knowledge to enhance both feature representation and feature interactions. In our experiments, we integrate LLaCTR with six representative CTR models across four datasets, demonstrating its superior performance in terms of both effectiveness and efficiency compared to existing LLM-enhanced methods. Our code is available at https://github.com/istarryn/LLaCTR.
Read more →

Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

arXiv:2505.16512v5 Announce Type: replace-cross Abstract: In recent years, the explosive advancement of deepfake technology has posed a critical and escalating threat to public security: diffusion-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency via multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the new large-scale multimodal digital human forgery dataset based on diffusion models. Leveraging five of the latest digital human generation methods and a voice cloning method, we systematically construct a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies demonstrate that the misrecognition rate by participants for DigiFakeAV reaches as high as 68%. Moreover, the substantial performance degradation of existing detection models on our dataset further highlights its challenges. To address this problem, we propose DigiShield, an effective detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves state-of-the-art (SOTA) performance on the DigiFakeAV and shows strong generalization on other datasets.
Read more →

In-context Language Learning for Endangered Languages in Speech Recognition

arXiv:2505.20445v5 Announce Type: replace-cross Abstract: With approximately 7,000 languages spoken worldwide, current large language models (LLMs) support only a small subset. Prior research indicates LLMs can learn new languages for certain tasks without supervised data. We extend this investigation to speech recognition, investigating whether LLMs can learn unseen, low-resource languages through in-context learning (ICL). With experiments on four diverse endangered languages that LLMs have not been trained on, we find that providing more relevant text samples enhances performance in both language modelling and Automatic Speech Recognition (ASR) tasks. Furthermore, we show that the probability-based approach outperforms the traditional instruction-based approach in language learning. Lastly, we show ICL enables LLMs to achieve ASR performance that is comparable to or even surpasses dedicated language models trained specifically for these languages, while preserving the original capabilities of the LLMs. Our code is publicly available.
Read more →

Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review

arXiv:2505.20503v2 Announce Type: replace-cross Abstract: Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action Models, have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interaction, mobile service robots can achieve more flexible understanding, adaptive behavior, and robust task execution in dynamic real-world environments. Despite this progress, embodied AI for mobile service robots continues to face fundamental challenges related to the translation of natural language instructions into executable robot actions, multimodal perception in human-centered environments, uncertainty estimation for safe decision-making, and computational constraints for real-time onboard deployment. In this paper, we present the first systematic review focused specifically on the integration of foundation models in mobile service robotics. We analyze how recent advances in foundation models address these core challenges through language-conditioned control, multimodal sensor fusion, uncertainty-aware reasoning, and efficient model scaling. We further examine real-world applications in domestic assistance, healthcare, and service automation, highlighting how foundation models enable context-aware, socially responsive, and generalizable robot behaviors. Beyond technical considerations, we discuss ethical, societal, and human-interaction implications associated with deploying foundation model-enabled service robots in human environments. Finally, we outline future research directions emphasizing reliability and lifelong adaptation, privacy-aware and resource-constrained deployment, and governance and human-in-the-loop frameworks required for safe, scalable, and trustworthy mobile service robotics.
Read more →

Orca: Browsing at Scale Through User-Driven and AI-Facilitated Orchestration Across Malleable Webpages

arXiv:2505.22831v2 Announce Type: replace-cross Abstract: Web-based activities span multiple webpages. However, conventional browsers with stacks of tabs cannot support operating and synthesizing large volumes of information across pages. While recent AI systems enable fully automated web browsing and information synthesis, they often diminish user agency and hinder contextual understanding. We explore how AI could instead augment user interactions with content across webpages and mitigate cognitive and manual efforts. Through literature on information tasks and web browsing challenges, and an iterative design process, we present novel interactions with our prototype web browser, Orca. Leveraging AI, Orca supports user-driven exploration, operation, organization, and synthesis of web content at scale. To enable browsing at scale, webpages are treated as malleable materials that humans and AI can collaboratively manipulate and compose into a malleable, dynamic, and browser-level workspace. Our evaluation revealed an increased "appetite" for information foraging, enhanced control, and more flexible sensemaking across a broader web information landscape.
Read more →

Can AI Master Econometrics? Evidence from Econometrics AI Agent on Expert-Level Tasks

arXiv:2506.00856v3 Announce Type: replace-cross Abstract: Can AI effectively perform complex econometric analysis traditionally requiring human expertise? This paper evaluates AI agents' capability to master econometrics, focusing on empirical analysis performance. We develop ``MetricsAI'', an Econometrics AI Agent built on the open-source MetaGPT framework. This agent exhibits outstanding performance in: (1) planning econometric tasks strategically, (2) generating and executing code, (3) employing error-based reflection for improved robustness, and (4) allowing iterative refinement through multi-round conversations. We construct two datasets from academic coursework materials and published research papers to evaluate performance against real-world challenges. Comparative testing shows our domain-specialized AI agent significantly outperforms both benchmark large language models (LLMs) and general-purpose AI agents. This work establishes a testbed for exploring AI's impact on social science research and enables cost-effective integration of domain expertise, making advanced econometric methods accessible to users with minimal coding skills. Furthermore, our AI agent enhances research reproducibility and offers promising pedagogical applications for econometrics teaching.
Read more →

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

arXiv:2506.04207v2 Announce Type: replace-cross Abstract: Inspired by the remarkable reasoning capabilities of Deepseek-R1 in complex textual tasks, many works attempt to incentivize similar capabilities in Multimodal Large Language Models (MLLMs) by directly applying reinforcement learning (RL). However, they still struggle to activate complex reasoning. In this paper, rather than examining multimodal RL in isolation, we delve into current training pipelines and identify three crucial phenomena: 1) Effective cold start initialization is critical for enhancing MLLM reasoning. Intriguingly, we find that initializing with carefully selected text data alone can lead to performance surpassing many recent multimodal reasoning models, even before multimodal RL. 2) Standard GRPO applied to multimodal RL suffers from gradient stagnation, which degrades training stability and performance. 3) Subsequent text-only RL training, following the multimodal RL phase, further enhances multimodal reasoning. This staged training approach effectively balances perceptual grounding and cognitive reasoning development. By incorporating the above insights and addressing multimodal RL issues, we introduce ReVisual-R1, achieving a new state-of-the-art among open-source 7B MLLMs on challenging benchmarks including MathVerse, MathVision, WeMath, LogicVista, DynaMath, and challenging AIME2024 and AIME2025.
Read more →

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization

arXiv:2506.07972v2 Announce Type: replace-cross Abstract: While Large Language Models (LLMs) have demonstrated significant advancements in reasoning and agent-based problem-solving, current evaluation methodologies fail to adequately assess their capabilities: existing benchmarks either rely on closed-ended questions prone to saturation and memorization, or subjective comparisons that lack consistency and rigor. In this work, we introduce HeuriGym, an agentic framework designed for evaluating heuristic algorithms generated by LLMs for combinatorial optimization problems, characterized by clearly defined objectives and expansive solution spaces. HeuriGym empowers LLMs to propose heuristics, receive evaluative feedback via code execution, and iteratively refine their solutions. We evaluate nine state-of-the-art models on nine problems across domains such as computer systems, logistics, and biology, exposing persistent limitations in tool use, planning, and adaptive reasoning. To quantify performance, we propose the Quality-Yield Index (QYI), a metric that captures both solution pass rate and quality. Even top models like GPT-o4-mini-high and Gemini-2.5-Pro attain QYI scores of only 0.6, well below the expert baseline of 1. Our open-source benchmark aims to guide the development of LLMs toward more effective and realistic problem-solving in scientific and engineering domains.
Read more →

Beyond Random Sampling: Efficient Language Model Pretraining via Curriculum Learning

arXiv:2506.11300v2 Announce Type: replace-cross Abstract: Curriculum learning-organizing training data from easy to hard-has improved efficiency across machine learning domains, yet remains underexplored for language model pretraining. We present the first systematic investigation of curriculum learning in LLM pretraining, with over 200 models trained on up to 100B tokens across three strategies: vanilla curriculum learning, pacing-based sampling, and interleaved curricula, guided by six difficulty metrics spanning linguistic and information-theoretic properties. We evaluate performance on eight benchmarks under three realistic scenarios: limited data, unlimited data, and continual training. Our experiments show that curriculum learning consistently accelerates convergence in early and mid-training phases,reducing training steps by $18-45\%$ to reach baseline performance. When applied as a warmup strategy before standard random sampling, curriculum learning yields sustained improvements up to $3.5\%$. We identify compression ratio, lexical diversity (MTLD), and readability (Flesch Reading Ease) as the most effective difficulty signals. Our findings demonstrate that data ordering-orthogonal to existing data selection methods-provides a practical mechanism for more efficient LLM pretraining.
Read more →

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs

arXiv:2506.11558v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have recently been extended to the video domain, enabling sophisticated video-language understanding. However, existing Video LLMs often exhibit limitations in fine-grained temporal reasoning, restricting their ability to precisely attribute responses to specific video moments, especially under constrained supervision. We introduce DaMO, a data-efficient Video LLM explicitly designed for accurate temporal reasoning and multimodal understanding. At its core, the proposed Temporal-aware Fuseformer employs a hierarchical dual-stream architecture that progressively captures temporal dynamics within each modality and effectively fuses complementary visual and audio information. To further enhance computational efficiency, DaMO integrates a global residual that reduces spatial redundancy while preserving essential semantic details. We train DaMO via a structured four-stage progressive training paradigm, incrementally equipping the model with multimodal alignment, semantic grounding, and temporal reasoning capabilities. This work also contributes multiple datasets augmented from existing ones with LLM-generated temporally grounded QA pairs for tasks requiring temporal supervision. Comprehensive experiments on temporal grounding and video QA benchmarks demonstrate that DaMO consistently surpasses prior methods, particularly in tasks demanding precise temporal alignment and reasoning. Our work establishes a promising direction for data-efficient video-language modeling.
Read more →

Governing Strategic Dynamics: Equilibrium Stabilization via Divergence-Driven Control

arXiv:2506.23734v2 Announce Type: replace-cross Abstract: Black-box coevolution in mixed-motive games is often undermined by opponent-drift non-stationarity and noisy rollouts, which distort progress signals and can induce cycling, Red-Queen dynamics, and detachment. We propose the \emph{Marker Gene Method} (MGM), a curriculum-inspired governance mechanism that stabilizes selection by anchoring evaluation to cross-generational marker individuals, together with DWAM and conservative marker-update rules to reduce spurious updates. We also introduce NGD-Div, which adapts the key update threshold using a divergence proxy and natural-gradient optimization. We provide theoretical analysis in strictly competitive settings and evaluate MGM integrated with evolution strategies (MGM-E-NES) on coordination games and a resource-depletion Markov game. MGM-E-NES reliably recovers target coordination in Stag Hunt and Battle of the Sexes, achieving final cooperation probabilities close to $(1,1)$ (e.g., $0.991\pm0.01/1.00\pm0.00$ and $0.97\pm0.00/0.97\pm0.00$ for the two players). In the Markov resource game, it maintains high and stable state-conditioned cooperation across 30 seeds, with final cooperation of $\approx 0.954/0.980/0.916$ in \textsc{Rich}/\textsc{Poor}/\textsc{Collapsed} (both players; small standard deviations), indicating welfare-aligned and state-dependent behavior. Overall, MGM-E-NES transfers across tasks with minimal hyperparameter changes and yields consistently stable training dynamics, showing that top-level governance can substantially improve the robustness of black-box coevolution in dynamic environments.
Read more →

FAIR-MATCH: A Multi-Objective Framework for Bias Mitigation in Reciprocal Dating Recommendations

arXiv:2507.01063v2 Announce Type: replace-cross Abstract: Online dating platforms have fundamentally transformed the formation of romantic relationships, with millions of users worldwide relying on algorithmic matching systems to find compatible partners. However, current recommendation systems in dating applications suffer from significant algorithmic deficiencies, including but not limited to popularity bias, filter bubble effects, and inadequate reciprocity modeling that limit effectiveness and introduce harmful biases. This research integrates foundational work with recent empirical findings to deliver a detailed analysis of dating app recommendation systems, highlighting key issues and suggesting research-backed solutions. Through analysis of reciprocal recommendation frameworks, fairness evaluation metrics, and industry implementations, we demonstrate that current systems achieve modest performance with collaborative filtering reaching 25.1\% while reciprocal methods achieve 28.7\%. Our proposed mathematical framework addresses these limitations through enhanced similarity measures, multi-objective optimization, and fairness-aware algorithms that maintain competitive accuracy while improving demographic representation to reduce algorithmic bias.
Read more →

X-SAM: From Segment Anything to Any Segmentation

arXiv:2508.04655v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.
Read more →

OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

arXiv:2508.16438v3 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks: 1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions. 2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.
Read more →

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

arXiv:2509.06350v2 Announce Type: replace-cross Abstract: Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.
Read more →

OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

arXiv:2509.09332v3 Announce Type: replace-cross Abstract: Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible. To address these gaps, we introduce OmniEVA -- an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io
Read more →

DoubleAgents: Interactive Simulations for Alignment in Agentic AI

arXiv:2509.12626v2 Announce Type: replace-cross Abstract: Agentic workflows promise efficiency, but adoption hinges on whether people can align systems that act on their behalf with their goals, values, and situational expectations. We present DoubleAgents, an agentic planning tool that embeds transparency and control through user intervention, value-reflecting policies, rich state visualizations, and uncertainty flagging for human coordination tasks. A built-in respondent simulation generates realistic scenarios, allowing users to rehearse and refine policies and calibrate their use of agentic behavior before live deployment. We evaluate DoubleAgents in a two-day lab study (n = 10), three deployment studies, and a technical evaluation. Results show that participants initially hesitated to delegate but used simulation to probe system behavior and adjust policies, gradually increasing delegation as agent actions became better aligned with their intentions and context. Deployment results demonstrate DoubleAgents' real-world relevance and usefulness, showing that simulation helps users effectively manage real-world tasks with higher complexity and uncertainty. We contribute interactive simulation as a practical pathway for users to iteratively align and calibrate agentic systems.
Read more →

ArchesClimate: Probabilistic Decadal Ensemble Generation With Flow Matching

arXiv:2509.15942v2 Announce Type: replace-cross Abstract: Climate projections have uncertainties related to components of the climate system and their interactions. A typical approach to quantifying these uncertainties is to use climate models to create ensembles of repeated simulations under different initial conditions. Due to the complexity of these simulations, generating such ensembles of projections is computationally expensive. In this work, we present ArchesClimate, a deep learning-based climate model emulator that aims to reduce this cost. ArchesClimate is trained on decadal hindcasts of the IPSL-CM6A-LR climate model at a spatial resolution of approximately 2.5x1.25 degrees. We train a flow matching model following ArchesWeatherGen, which we adapt to predict near-term climate. Once trained, the model generates states at a one-month lead time and can be used to auto-regressively emulate climate model simulations of any length. We show that for up to 10 years, these generations are stable and physically consistent. We also show that for several important climate variables, ArchesClimate generates simulations that are interchangeable with the IPSL model. This work suggests that climate model emulators could significantly reduce the cost of climate model simulations.
Read more →

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

arXiv:2509.17641v2 Announce Type: replace-cross Abstract: Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.
Read more →

Understanding Post-Training Structural Changes in Large Language Models

arXiv:2509.17866v3 Announce Type: replace-cross Abstract: Post-training fundamentally alters the behavior of large language models (LLMs), yet its impact on the internal parameter space remains poorly understood. In this work, we conduct a systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation. Our analysis reveals two unexpected and robust structural changes: (1) a near-uniform geometric scaling of singular values across layers; and (2) highly consistent orthogonal transformations are applied to the left and right singular vectors of each matrix. Based on these findings, We propose a simple yet effective framework to describe the coordinated dynamics of parameters in LLMs, which elucidates why post-training inherently relies on the foundational capabilities developed during pre-training. Further experiments demonstrate that singular value scaling underpins the temperature-controlled regulatory mechanisms of post-training, while the coordinated rotation of singular vectors encodes the essential semantic alignment. These results challenge the prevailing view of the parameter space in large models as a black box, uncovering the first clear regularities in how parameters evolve during training, and providing a new perspective for deeper investigation into model parameter changes.
Read more →

Addressing Gradient Misalignment in Data-Augmented Training for Robust Speech Deepfake Detection

arXiv:2509.20682v2 Announce Type: replace-cross Abstract: In speech deepfake detection (SDD), data augmentation (DA) is commonly used to improve model generalization across varied speech conditions and spoofing attacks. However, during training, the backpropagated gradients from original and augmented inputs may misalign, which can result in conflicting parameter updates. These conflicts could hinder convergence and push the model toward suboptimal solutions, thereby reducing the benefits of DA. To investigate and address this issue, we design a dual-path data-augmented (DPDA) training framework with gradient alignment for SDD. In our framework, each training utterance is processed through two input paths: one using the original speech and the other with its augmented version. This design allows us to compare and align their backpropagated gradient directions to reduce optimization conflicts. Our analysis shows that approximately 25% of training iterations exhibit gradient conflicts between the original inputs and their augmented counterparts when using RawBoost augmentation. By resolving these conflicts with gradient alignment, our method accelerates convergence by reducing the number of training epochs and achieves up to an 18.69% relative reduction in Equal Error Rate on the In-the-Wild dataset compared to the baseline.
Read more →

SiNGER: A Clearer Voice Distills Vision Transformers Further

arXiv:2509.20986v3 Announce Type: replace-cross Abstract: Vision Transformers are widely adopted as the backbone of vision foundation models, but they are known to produce high-norm artifacts that degrade representation quality. When knowledge distillation transfers these features to students, high-norm artifacts dominate the objective, so students overfit to artifacts and underweight informative signals, diminishing the gains from larger models. Prior work attempted to remove artifacts but encountered an inherent trade-off between artifact suppression and preserving informative signals from teachers. To address this, we introduce Singular Nullspace-Guided Energy Reallocation (SiNGER), a novel distillation framework that suppresses artifacts while preserving informative signals. The key idea is principled teacher feature refinement: during refinement, we leverage the nullspace-guided perturbation to preserve information while suppressing artifacts. Then, the refined teacher's features are distilled to a student. We implement this perturbation efficiently with a LoRA-based adapter that requires minimal structural modification. Extensive experiments show that \oursname consistently improves student models, achieving state-of-the-art performance in multiple downstream tasks and producing clearer and more interpretable representations.
Read more →

Mechanism of Task-oriented Information Removal in In-context Learning

arXiv:2509.21012v3 Announce Type: replace-cross Abstract: In-context Learning (ICL) is an emerging few-shot learning paradigm based on modern Language Models (LMs), yet its inner mechanism remains unclear. In this paper, we investigate the mechanism through a novel perspective of information removal. Specifically, we demonstrate that in the zero-shot scenario, LMs encode queries into non-selective representations in hidden states containing information for all possible tasks, leading to arbitrary outputs without focusing on the intended task, resulting in near-zero accuracy. Meanwhile, we find that selectively removing specific information from hidden states by a low-rank filter effectively steers LMs toward the intended task. Building on these findings, by measuring the hidden states on carefully designed metrics, we observe that few-shot ICL effectively simulates such task-oriented information removal processes, selectively removing the redundant information from entangled non-selective representations, and improving the output based on the demonstrations, which constitutes a key mechanism underlying ICL. Moreover, we identify essential attention heads inducing the removal operation, termed Denoising Heads, which enables the ablation experiments blocking the information removal operation from the inference, where the ICL accuracy significantly degrades, especially when the correct label is absent from the few-shot demonstrations, confirming both the critical role of the information removal mechanism and denoising heads.
Read more →

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

arXiv:2509.22258v4 Announce Type: replace-cross Abstract: Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.
Read more →

Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents

arXiv:2509.23040v2 Announce Type: replace-cross Abstract: Large language models face challenges in long-context question answering, where key evidence of a query may be dispersed across millions of tokens. Existing works equip large language models with a memory buffer that is dynamically updated via a linear document scan, also known as the "memorize while reading" methods. While this approach scales efficiently, it suffers from pruning of latent evidence, information loss through overwriting, and sparse reinforcement learning signals. To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning. To further strengthen training, we propose a multi-level reward design, which combines final-answer rewards with dense, step-level signals that guide effective memory use. Together, these contributions mitigate information degradation, improve supervision, and support complex multi-hop reasoning. Extensive experiments demonstrate that ReMemR1 significantly outperforms state-of-the-art baselines on long-context question answering while incurring negligible computational overhead, validating its ability to trade marginal cost for robust long-context reasoning.
Read more →

Discrete Variational Autoencoding via Policy Search

arXiv:2509.24716v2 Announce Type: replace-cross Abstract: Discrete latent bottlenecks in variational autoencoders (VAEs) offer high bit efficiency and can be modeled with autoregressive discrete distributions, enabling parameter-efficient multimodal search with transformers. However, discrete random variables do not allow for exact differentiable parameterization; therefore, discrete VAEs typically rely on approximations, such as Gumbel-Softmax reparameterization or straight-through gradient estimates, or employ high-variance gradient-free methods such as REINFORCE that have had limited success on high-dimensional tasks such as image reconstruction. Inspired by popular techniques in policy search, we propose a training framework for discrete VAEs that leverages the natural gradient of a non-parametric encoder to update the parametric encoder without requiring reparameterization. Our method, combined with automatic step size adaptation and a transformer-based encoder, scales to challenging datasets such as ImageNet and outperforms both approximate reparameterization methods and quantization-based discrete autoencoders in reconstructing high-dimensional data from compact latent spaces.
Read more →

mR3: Multilingual Rubric-Agnostic Reward Reasoning Models

arXiv:2510.01146v2 Announce Type: replace-cross Abstract: Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including support for reasoning in the target language. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Finally, we demonstrate the effectiveness of mR3 in off-policy preference optimization and validate the quality of its reasoning traces and rubric-based evaluations through human studies with 20 annotators across 12 languages, where mR3 models' reasoning is preferred, including for extremely low-resource languages that are entirely unseen during training. Our models, data, and code are available as open source at https://github.com/rubricreward/mr3.
Read more →

GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

arXiv:2510.02180v2 Announce Type: replace-cross Abstract: Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield black-box models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the MuJoCo, BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.
Read more →

Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs

arXiv:2510.09885v3 Announce Type: replace-cross Abstract: Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffers from 1) reliance on compute-heavy paraphrase augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate that the same demasking objective improves supervised fine-tuning (SFT) on math tasks over standard SFT, suggesting broader applicability of the demasking objective.
Read more →

Deep SPI: Safe Policy Improvement via World Models

arXiv:2510.12312v2 Announce Type: replace-cross Abstract: Safe policy improvement (SPI) offers theoretical control over policy updates, yet existing guarantees largely concern offline, tabular reinforcement learning (RL). We study SPI in general online settings, when combined with world model and representation learning. We develop a theoretical framework showing that restricting policy updates to a well-defined neighborhood of the current policy ensures monotonic improvement and convergence. This analysis links transition and reward prediction losses to representation quality, yielding online, "deep" analogues of classical SPI theorems from the offline RL literature. Building on these results, we introduce DeepSPI, a principled on-policy algorithm that couples local transition and reward losses with regularised policy updates. On the ALE-57 benchmark, DeepSPI matches or exceeds strong baselines, including PPO and DeepMDPs, while retaining theoretical guarantees.
Read more →

Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

arXiv:2510.12603v2 Announce Type: replace-cross Abstract: Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilitate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M$^3$CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45\% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches.
Read more →

Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

arXiv:2510.14616v2 Announce Type: replace-cross Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.
Read more →

Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

arXiv:2510.23530v2 Announce Type: replace-cross Abstract: Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.
Read more →

COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations

arXiv:2510.24810v2 Announce Type: replace-cross Abstract: Fact-checking on major platforms, such as X, Meta, and TikTok, is shifting from expert-driven verification to a community-based setup, where users contribute explanatory notes to clarify why a post might be misleading. An important challenge here is determining whether an explanation is helpful for understanding real-world claims and the reasons why, which remains largely underexplored in prior research. In practice, most community notes remain unpublished due to slow community annotation, and the reasons for helpfulness lack clear definitions. To bridge these gaps, we introduce the task of predicting both the helpfulness of explanatory notes and the reason for this. We present COMMUNITYNOTES, a large-scale multilingual dataset of 104k posts with user-provided notes and helpfulness labels. We further propose a framework that automatically generates and improves reason definitions via automatic prompt optimization, and integrate them into prediction. Our experiments show that the optimized definitions can improve both helpfulness and reason prediction. Finally, we show that the helpfulness information is beneficial for existing fact-checking systems.
Read more →

First is Not Really Better Than Last: Evaluating Layer Choice and Aggregation Strategies in Language Model Data Influence Estimation

arXiv:2511.04715v2 Announce Type: replace-cross Abstract: Identifying how training samples influence/impact Large Language Model (LLM) decision-making is essential for effectively interpreting model decisions and auditing large-scale datasets. Current training sample influence estimation methods (also known as influence functions) undertake this goal by utilizing information flow through the model via its first-order and higher-order gradient terms. However, owing to the large model sizes of today consisting of billions of parameters, these influence computations are often restricted to some subset of model layers to ensure computational feasibility. Prior seminal work by Yeh et al. (2022) in assessing which layers are best suited for computing language data influence concluded that the first (embedding) layers are the most informative for this purpose, using a hypothesis based on influence scores canceling out (i.e., the cancellation effect). In this work, we propose theoretical and empirical evidence demonstrating how the cancellation effect is unreliable, and that middle attention layers are better estimators for influence. Furthermore, we address the broader challenge of aggregating influence scores across layers, and showcase how alternatives to standard averaging (such as ranking and vote-based methods) can lead to significantly improved performance. Finally, we propose better methods for evaluating influence score efficacy in LLMs without undertaking model retraining, and propose a new metric known as the Noise Detection Rate (NDR) that exhibits strong predictive capability compared to the cancellation effect. Through extensive experiments across LLMs of varying types and scales, we concretely determine that the first (layers) are not necessarily better than the last (layers) for LLM influence estimation, contrasting with prior knowledge in the field.
Read more →

DeepBooTS: Dual-Stream Residual Boosting for Drift-Resilient Time-Series Forecasting

arXiv:2511.06893v2 Announce Type: replace-cross Abstract: Time-Series (TS) exhibits pronounced non-stationarity. Consequently, most forecasting methods display compromised robustness to concept drift, despite the prevalent application of instance normalization. We tackle this challenge by first analysing concept drift through a bias-variance lens and proving that weighted ensemble reduces variance without increasing bias. These insights motivate DeepBooTS, a novel end-to-end dual-stream residual-decreasing boosting method that progressively reconstructs the intrinsic signal. In our design, each block of a deep model becomes an ensemble of learners with an auxiliary output branch forming a highway to the final prediction. The block-wise outputs correct the residuals of previous blocks, leading to a learning-driven decomposition of both inputs and targets. This method enhances versatility and interpretability while substantially improving robustness to concept drift. Extensive experiments, including those on large-scale datasets, show that the proposed method outperforms existing methods by a large margin, yielding an average performance improvement of 15.8% across various datasets, establishing a new benchmark for TS forecasting.
Read more →

RacketVision: A Multiple Racket Sports Benchmark for Unified Ball and Racket Analysis

arXiv:2511.17045v3 Announce Type: replace-cross Abstract: We introduce RacketVision, a novel dataset and benchmark for advancing computer vision in sports analytics, covering table tennis, tennis, and badminton. The dataset is the first to provide large-scale, fine-grained annotations for racket pose alongside traditional ball positions, enabling research into complex human-object interactions. It is designed to tackle three interconnected tasks: fine-grained ball tracking, articulated racket pose estimation, and predictive ball trajectory forecasting. Our evaluation of established baselines reveals a critical insight for multi-modal fusion: while naively concatenating racket pose features degrades performance, a CrossAttention mechanism is essential to unlock their value, leading to trajectory prediction results that surpass strong unimodal baselines. RacketVision provides a versatile resource and a strong starting point for future research in dynamic object tracking, conditional motion forecasting, and multimodal analysis in sports. Project page at https://github.com/OrcustD/RacketVision
Read more →

Tracing Mathematical Proficiency Through Problem-Solving Processes

arXiv:2512.00311v2 Announce Type: replace-cross Abstract: Knowledge Tracing (KT) aims to model student's knowledge state and predict future performance to enable personalized learning in Intelligent Tutoring Systems. However, traditional KT methods face fundamental limitations in explainability, as they rely solely on the response correctness, neglecting the rich information embedded in students' problem-solving processes. To address this gap, we propose Knowledge Tracing Leveraging Problem-Solving Process (KT-PSP), which incorporates students' problem-solving processes to capture the multidimensional aspects of mathematical proficiency. We also introduce KT-PSP-25, a new dataset specifically designed for the KT-PSP. Building on this, we present StatusKT, a KT framework that employs a teacher-student-teacher three-stage LLM pipeline to extract students' MP as intermediate signals. In this pipeline, the teacher LLM first extracts problem-specific proficiency indicators, then a student LLM generates responses based on the student's solution process, and a teacher LLM evaluates these responses to determine mastery of each indicator. The experimental results on KT-PSP-25 demonstrate that StatusKT improves the prediction performance of existing KT methods. Moreover, StatusKT provides interpretable explanations for its predictions by explicitly modeling students' mathematical proficiency.
Read more →

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

arXiv:2512.03553v2 Announce Type: replace-cross Abstract: Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.
Read more →

DAUNet: A Lightweight UNet Variant with Deformable Convolutions and Parameter-Free Attention for Medical Image Segmentation

arXiv:2512.07051v2 Announce Type: replace-cross Abstract: Medical image segmentation plays a pivotal role in automated diagnostic and treatment planning systems. In this work, we present DAUNet, a novel lightweight UNet variant that integrates Deformable V2 Convolutions and Parameter-Free Attention (SimAM) to improve spatial adaptability and context-aware feature fusion without increasing model complexity. DAUNet's bottleneck employs dynamic deformable kernels to handle geometric variations, while the decoder and skip pathways are enhanced using SimAM attention modules for saliency-aware refinement. Extensive evaluations on two challenging datasets, FH-PS-AoP (fetal head and pubic symphysis ultrasound) and FUMPE (CT-based pulmonary embolism detection), demonstrate that DAUNet outperforms state-of-the-art models in Dice score, HD95, and ASD, while maintaining superior parameter efficiency. Ablation studies highlight the individual contributions of deformable convolutions and SimAM attention. DAUNet's robustness to missing context and low-contrast regions establishes its suitability for deployment in real-time and resource-constrained clinical environments.
Read more →

Dual-objective Language Models: Training Efficiency Without Overfitting

arXiv:2512.14549v2 Announce Type: replace-cross Abstract: This paper combines autoregressive and masked-diffusion training objectives without any architectural modifications, resulting in flexible language models that outperform single-objective models. Autoregressive modeling has been a popular approach, partly because of its training efficiency; however, that comes at the cost of sensitivity to overfitting. On the other hand, masked-diffusion models are less efficient to train while being more resilient to overfitting. In this work, we demonstrate that dual-objective training achieves the best of both worlds. To derive the optimal balance between both objectives, we train and evaluate 50 language models under varying levels of data repetition. We show that it is optimal to combine both objectives under all evaluated settings and that the optimal balance is similar whether targeting autoregressive or masked-diffusion downstream performance.
Read more →

Spectral Representation-based Reinforcement Learning

arXiv:2512.15036v2 Announce Type: replace-cross Abstract: In real-world applications with large state and action spaces, reinforcement learning (RL) typically employs function approximations to represent core components like the policies, value functions, and dynamics models. Although powerful approximations such as neural networks offer great expressiveness, they often present theoretical ambiguities, suffer from optimization instability and exploration difficulty, and incur substantial computational costs in practice. In this paper, we introduce the perspective of spectral representations as a solution to address these difficulties in RL. Stemming from the spectral decomposition of the transition operator, this framework yields an effective abstraction of the system dynamics for subsequent policy optimization while also providing a clear theoretical characterization. We reveal how to construct spectral representations for transition operators that possess latent variable structures or energy-based structures, which implies different learning methods to extract spectral representations from data. Notably, each of these learning methods realizes an effective RL algorithm under this framework. We also provably extend this spectral view to partially observable MDPs. Finally, we validate these algorithms on over 20 challenging tasks from the DeepMind Control Suite, where they achieve performances comparable or superior to current state-of-the-art model-free and model-based baselines.
Read more →

KV Admission: Learning What to Write for Efficient Long-Context Inference

arXiv:2512.17452v3 Announce Type: replace-cross Abstract: Long-context LLM inference is bottlenecked by the quadratic attention complexity and linear KV cache growth. Prior approaches mitigate this via post-hoc selection or eviction but overlook the root inefficiency: indiscriminate writing to memory. In this paper, we formalize KV cache management as a causal system of three primitives: KV Admission, Selection, and Eviction. We instantiate KV Admission via Write-Gated KV (WG-KV), a lightweight mechanism that learns to predict token utility before cache entry. By filtering out low-utility states early to maintain a compact global cache alongside a sliding local cache, WG-KV reduces memory usage by 46-68% and delivers 3.03-3.70x prefill and 1.85-2.56x decode speedups on Llama and Qwen models, while maintaining compatibility with FlashAttention and Paged-KV systems. These results demonstrate that learning what to write is a principled and practical recipe for efficient long-context inference. Code is available at https://github.com/EMCLab-Sinica/WG-KV.
Read more →

ShareChat: A Dataset of Chatbot Conversations in the Wild

arXiv:2512.17843v3 Announce Type: replace-cross Abstract: While academic research typically treats Large Language Models (LLM) as generic text generators, they are distinct commercial products with unique interfaces and capabilities that fundamentally shape user behavior. Current datasets obscure this reality by collecting text-only data through uniform interfaces that fail to capture authentic chatbot usage. To address this limitation, we present ShareChat, a large-scale corpus of 142,808 conversations (660,293 turns) sourced directly from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat distinguishes itself by preserving native platform affordances, such as citations and thinking traces, across a diverse collection covering 101 languages and the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. To illustrate the dataset's breadth, we present three case studies: a completeness analysis of intent satisfaction, a citation study of model grounding, and a temporal analysis of engagement rhythms. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild. The dataset is publicly available via Hugging Face.
Read more →

Mitigating LLM Hallucination via Behaviorally Calibrated Reinforcement Learning

arXiv:2512.19920v3 Announce Type: replace-cross Abstract: LLM deployment in critical domains is currently impeded by persistent hallucinations--generating plausible but factually incorrect assertions. While scaling laws drove significant improvements in general capabilities, theoretical frameworks suggest hallucination is not merely stochastic error but a predictable statistical consequence of training objectives prioritizing mimicking data distribution over epistemic honesty. Standard RLVR paradigms, utilizing binary reward signals, inadvertently incentivize models as good test-takers rather than honest communicators, encouraging guessing whenever correctness probability exceeds zero. This paper presents an exhaustive investigation into behavioral calibration, which incentivizes models to stochastically admit uncertainty by abstaining when not confident, aligning model behavior with accuracy. Synthesizing recent advances, we propose and evaluate training interventions optimizing strictly proper scoring rules for models to output a calibrated probability of correctness. Our methods enable models to either abstain from producing a complete response or flag individual claims where uncertainty remains. Utilizing Qwen3-4B-Instruct, empirical analysis reveals behavior-calibrated reinforcement learning allows smaller models to surpass frontier models in uncertainty quantification--a transferable meta-skill decouplable from raw predictive accuracy. Trained on math reasoning tasks, our model's log-scale Accuracy-to-Hallucination Ratio gain (0.806) exceeds GPT-5's (0.207) in a challenging in-domain evaluation (BeyondAIME). Moreover, in cross-domain factual QA (SimpleQA), our 4B LLM achieves zero-shot calibration error on par with frontier models including Grok-4 and Gemini-2.5-Pro, even though its factual accuracy is much lower.
Read more →

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

arXiv:2512.20573v3 Announce Type: replace-cross Abstract: Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a strength for drafters in speculative decoding with autoregressive (AR) verifiers. Our core insight is that dLLM's speed from parallel decoding drastically lowers the risk of costly rejections, providing a practical mechanism to effectively realize the (elusive) lengthy drafts that lead to large speedups with speculative decoding. We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length. It "fails fast" by spending minimal compute in hard-to-speculate regions to shrink speculation latency and "wins big" by aggressively extending draft lengths in easier regions to reduce verification latency (in many cases, speculating and accepting 70 tokens at a time!). Without any fine-tuning, FailFast delivers lossless acceleration of AR LLMs and achieves up to 4.9$\times$ speedup over vanilla decoding, 1.7$\times$ over the best naive dLLM drafter, and 1.7$\times$ over EAGLE-3 across diverse models and workloads. We open-source FailFast at https://github.com/ruipeterpan/failfast.
Read more →

The Bayesian Geometry of Transformer Attention

arXiv:2512.22471v3 Announce Type: replace-cross Abstract: Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.
Read more →

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

arXiv:2512.22473v3 Announce Type: replace-cross Abstract: Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = \alpha_{ij}\bigl(b_{ij}-\mathbb{E}_{\alpha_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ \Delta v_j = -\eta\sum_i \alpha_{ij} u_i, \] where $u_i$ is the upstream gradient at position $i$ and $\alpha_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).
Read more →

The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models

arXiv:2512.23340v2 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) have been largely driven by scaling laws for individual models, which predict performance improvements as model parameters and data volume increase. However, the capabilities of any single LLM are inherently bounded. One solution originates from intricate interactions among multiple LLMs, rendering their collective performance surpasses that of any constituent model. Despite the rapid proliferation of multi-model integration techniques such as model routing and post-hoc ensembling, a unifying theoretical framework of performance scaling for multi-model collaboration remains absent. In this work, we propose the Law of Multi-model Collaboration, a scaling law that predicts the performance limits of LLM ensembles based on their aggregated parameter budget. To quantify the intrinsic upper bound of multi-model collaboration, we adopt a method-agnostic formulation and assume an idealized integration oracle where the total cross-entropy loss of each sample is determined by the minimum loss of any model in the model pool. Experimental results reveal that multi-model systems follow a power-law scaling with respect to the total parameter count, exhibiting a more significant improvement trend and a lower theoretical loss floor compared to single model scaling. Moreover, ensembles of heterogeneous model families achieve better performance scaling than those formed within a single model family, indicating that model diversity is a primary driver of collaboration gains. These findings suggest that model collaboration represents a critical axis for extending the intelligence frontier of LLMs.
Read more →

RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

arXiv:2512.23565v5 Announce Type: replace-cross Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50\% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.
Read more →

Geometric Scaling of Bayesian Inference in LLMs

arXiv:2512.23752v3 Announce Type: replace-cross Abstract: Recent work has shown that small transformers trained in controlled "wind-tunnel'' settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate -- low-dimensional value manifolds and progressively orthogonal keys -- that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings. To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.
Read more →

FaithSCAN: Model-Driven Single-Pass Hallucination Detection for Faithful Visual Question Answering

arXiv:2601.00269v2 Announce Type: replace-cross Abstract: Faithfulness hallucinations in VQA occur when vision-language models produce fluent yet visually ungrounded answers, severely undermining their reliability in safety-critical applications. Existing detection methods mainly fall into two categories: external verification approaches relying on auxiliary models or knowledge bases, and uncertainty-driven approaches using repeated sampling or uncertainty estimates. The former suffer from high computational overhead and are limited by external resource quality, while the latter capture only limited facets of model uncertainty and fail to sufficiently explore the rich internal signals associated with the diverse failure modes. Both paradigms thus have inherent limitations in efficiency, robustness, and detection performance. To address these challenges, we propose FaithSCAN: a lightweight network that detects hallucinations by exploiting rich internal signals of VLMs, including token-level decoding uncertainty, intermediate visual representations, and cross-modal alignment features. These signals are fused via branch-wise evidence encoding and uncertainty-aware attention. We also extend the LLM-as-a-Judge paradigm to VQA hallucination and propose a low-cost strategy to automatically generate model-dependent supervision signals, enabling supervised training without costly human labels while maintaining high detection accuracy. Experiments on multiple VQA benchmarks show that FaithSCAN significantly outperforms existing methods in both effectiveness and efficiency. In-depth analysis shows hallucinations arise from systematic internal state variations in visual perception, cross-modal reasoning, and language decoding. Different internal signals provide complementary diagnostic cues, and hallucination patterns vary across VLM architectures, offering new insights into the underlying causes of multimodal hallucinations.
Read more →

Diffusion Timbre Transfer Via Mutual Information Guided Inpainting

arXiv:2601.01294v2 Announce Type: replace-cross Abstract: We study timbre transfer as an inference-time editing problem for music audio. Starting from a strong pre-trained latent diffusion model, we introduce a lightweight procedure that requires no additional training: (i) a dimension-wise noise injection that targets latent channels most informative of instrument identity, and (ii) an early-step clamping mechanism that re-imposes the input's melodic and rhythmic structure during reverse diffusion. The method operates directly on audio latents and is compatible with text/audio conditioning (e.g., CLAP). We discuss design choices,analyze trade-offs between timbral change and structural preservation, and show that simple inference-time controls can meaningfully steer pre-trained models for style-transfer use cases.
Read more →

VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing

arXiv:2601.07315v3 Announce Type: replace-cross Abstract: Analog mixed-signal circuit sizing involves complex trade-offs within high-dimensional design spaces. Existing automatic analog circuit sizing approaches rely solely on netlists, ignoring the circuit schematic, which hinders the cognitive link between the schematic and its performance. Furthermore, the black-box nature of machine learning methods and hallucination risks in large language models fail to provide the necessary ground-truth explainability required for industrial sign-off. To address these challenges, we propose a Vision Language Model-optimized collaborative agent design workflow (VLM-CAD), which analyzes circuits, optimizes DC operating points, performs inference-based sizing, and executes external sizing optimization. We integrate Image2Net to annotate circuit schematics and generate a structured JSON description for precise interpretation by Vision Language Models. Furthermore, we propose an Explainable Trust Region Bayesian Optimization method (ExTuRBO) that employs collaborative warm-start from agent-generated seeds and offers dual-granularity sensitivity analysis for external sizing optimization, supporting a comprehensive final design report. Experiment results on amplifier sizing tasks using 180nm, 90nm, and 45nm Predictive Technology Models demonstrate that VLM-CAD effectively balances power and performance while maintaining physics-based explainability. VLM-CAD meets all specification requirements while maintaining low power consumption in optimizing an amplifier with a complementary input and a class-AB output stage, with a total runtime under 66 minutes across all experiments on two amplifiers.
Read more →

Demystifying the Slash Pattern in Attention: The Role of RoPE

arXiv:2601.08297v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) often exhibit slash attention patterns, where attention scores concentrate along the $\Delta$-th sub-diagonal for some offset $\Delta$. These patterns play a key role in passing information across tokens. But why do they emerge? In this paper, we demystify the emergence of these Slash-Dominant Heads (SDHs) from both empirical and theoretical perspectives. First, by analyzing open-source LLMs, we find that SDHs are intrinsic to models and generalize to out-of-distribution prompts. To explain the intrinsic emergence, we analyze the queries, keys, and Rotary Position Embedding (RoPE), which jointly determine attention scores. Our empirical analysis reveals two characteristic conditions of SDHs: (1) Queries and keys are almost rank-one, and (2) RoPE is dominated by medium- and high-frequency components. Under these conditions, queries and keys are nearly identical across tokens, and interactions between medium- and high-frequency components of RoPE give rise to SDHs. Beyond empirical evidence, we theoretically show that these conditions are sufficient to ensure the emergence of SDHs by formalizing them as our modeling assumptions. Particularly, we analyze the training dynamics of a shallow Transformer equipped with RoPE under these conditions, and prove that models trained via gradient descent exhibit SDHs. The SDHs generalize to out-of-distribution prompts.
Read more →

HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

arXiv:2601.10187v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively "tames" the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.
Read more →

LAPS: A Length-Aware-Prefill LLM Serving System

arXiv:2601.11589v2 Announce Type: replace-cross Abstract: LAPS identifies and disaggregates requests with different prompt lengths in LLM serving to reduce TTFT latency. While recent systems have decoupled the prefill and decode stages to improve throughput, they still rely on unified scheduling policies that fail to adapt to heterogeneous workload characteristics. We observe that prompt-length variations lead to distinct performance bottlenecks, motivating an adaptive scheduling strategy. LAPS disaggregates multi-turn long-prefill requests from short-prefill ones and introduces a length-aware smart batching mechanism for short-prefill workloads. It adopts a dual-queue design that supports temporal disaggregation on a single prefill instance or spatial disaggregation across multiple instances. For short-prefill batches, a batch waiting window and CUDA Graph-based clustering mitigate interference from heterogeneous computation, reducing batching delay and lowering average latency. In real multi-turn workloads, LAPS reduces prefill latency by over 30\% compared to vanilla SGLang under prefill-decode disaggregation, and further decreases SLO violations by 28\% in multi-instance deployments with vanilla data-parallel configuration. Compared to the SGLang router with load balancing, it further lowers SLO violations by 12\% in multi-GPU settings. Under high concurrency and mixed-request scenarios, LAPS improves request throughput by 35\% serving Qwen2.5-32B model for prefill instance, demonstrating its effectiveness in optimizing heterogeneous LLM serving workloads.
Read more →

Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

arXiv:2601.14172v2 Announce Type: replace-cross Abstract: We study sentence-level detection of the 19 human values in the refined Schwartz continuum in about 74k English sentences from news and political manifestos (ValueEval'24 corpus). Each sentence is annotated with value presence, yielding a binary moral-presence label and a 19-way multi-label task under severe class imbalance. First, we show that moral presence is learnable from single sentences: a DeBERTa-base classifier attains positive-class F1 = 0.74 with calibrated thresholds. Second, we compare direct multi-label value detectors with presence-gated hierarchies under a single 8 GB GPU budget. Under matched compute, presence gating does not improve over direct prediction, indicating that gate recall becomes a bottleneck. Third, we investigate lightweight auxiliary signals - short-range context, LIWC-22 and moral lexica, and topic features - and small ensembles. Our best supervised configuration, a soft-voting ensemble of DeBERTa-based models enriched with such signals, reaches macro-F1 = 0.332 on the 19 values, improving over the best previous English-only baseline on this corpus (macro-F1 $\approx$ 0.28). We additionally benchmark 7-9B instruction-tuned LLMs (Gemma 2 9B, Llama 3.1 8B, Mistral 8B, Qwen 2.5 7B) in zero-/few-shot and QLoRA setups, and find that they lag behind the supervised ensemble under the same hardware constraint. Overall, our results provide empirical guidance for building compute-efficient, value-aware NLP models under realistic GPU budgets.
Read more →

PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

arXiv:2601.16669v2 Announce Type: replace-cross Abstract: As large language models (LLMs) are increasingly applied to legal domain-specific tasks, evaluating their ability to perform legal work in real-world settings has become essential. However, existing legal benchmarks rely on simplified and highly standardized tasks, failing to capture the ambiguity, complexity, and reasoning demands of real legal practice. Moreover, prior evaluations often adopt coarse, single-dimensional metrics and do not explicitly assess fine-grained legal reasoning. To address these limitations, we introduce PLawBench, a Practical Law Benchmark designed to evaluate LLMs in realistic legal practice scenarios. Grounded in real-world legal workflows, PLawBench models the core processes of legal practitioners through three task categories: public legal consultation, practical case analysis, and legal document generation. These tasks assess a model's ability to identify legal issues and key facts, perform structured legal reasoning, and generate legally coherent documents. PLawBench comprises 850 questions across 13 practical legal scenarios, with each question accompanied by expert-designed evaluation rubrics, resulting in approximately 12,500 rubric items for fine-grained assessment. Using an LLM-based evaluator aligned with human expert judgments, we evaluate 10 state-of-the-art LLMs. Experimental results show that none achieves strong performance on PLawBench, revealing substantial limitations in the fine-grained legal reasoning capabilities of current LLMs and highlighting important directions for future evaluation and development of legal LLMs. Data is available at: https://github.com/skylenage/PLawbench.
Read more →

Sparsity-Aware Low-Rank Representation for Efficient Fine-Tuning of Large Language Models

arXiv:2601.16991v2 Announce Type: replace-cross Abstract: Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA's performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of $(1 - r/\min(d,k))$. To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50\% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by $2\times$, and delivers up to a $1.7\times$ inference speedup.
Read more →

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations

arXiv:2601.17087v2 Announce Type: replace-cross Abstract: Agentic benchmarks increasingly rely on LLM-simulated users to scalably evaluate agent performance, yet the robustness, validity, and fairness of this approach remain unexamined. Through a user study with participants across the United States, India, Kenya, and Nigeria, we investigate whether LLM-simulated users serve as reliable proxies for real human users in evaluating agents on {\tau}-Bench retail tasks. We find that user simulation lacks robustness, with agent success rates varying up to 9 percentage points across different user LLMs. Furthermore, evaluations using simulated users exhibit systematic miscalibration, underestimating agent performance on challenging tasks and overestimating it on moderately difficult ones. African American Vernacular English (AAVE) speakers experience consistently worse success rates and calibration errors than Standard American English (SAE) speakers, with disparities compounding significantly with age. We also find simulated users to be a differentially effective proxy for different populations, performing worst for AAVE and Indian English speakers. Additionally, simulated users introduce conversational artifacts and surface different failure patterns than human users. These findings demonstrate that current evaluation practices risk misrepresenting agent capabilities across diverse user populations and may obscure real-world deployment challenges.
Read more →

Spatiotemporal Semantic V2X Framework for Cooperative Collision Prediction

arXiv:2601.17216v2 Announce Type: replace-cross Abstract: Intelligent Transportation Systems (ITS) demand real-time collision prediction to ensure road safety and reduce accident severity. Conventional approaches rely on transmitting raw video or high-dimensional sensory data from roadside units (RSUs) to vehicles, which is impractical under vehicular communication bandwidth and latency constraints. In this work, we propose a semantic V2X framework in which RSU-mounted cameras generate spatiotemporal semantic embeddings of future frames using the Video Joint Embedding Predictive Architecture (V-JEPA). To evaluate the system, we construct a digital twin of an urban traffic environment enabling the generation of d verse traffic scenarios with both safe and collision events. These embeddings of the future frame, extracted from V-JEPA, capture task-relevant traffic dynamics and are transmitted via V2X links to vehicles, where a lightweight attentive probe and classifier decode them to predict imminent collisions. By transmitting only semantic embeddings instead of raw frames, the proposed system significantly reduces communication overhead while maintaining predictive accuracy. Experimental results demonstrate that the framework with an appropriate processing method achieves a 10% F1-score improvement for collision prediction while reducing transmission requirements by four orders of magnitude compared to raw video. This validates the potential of semantic V2X communication to enable cooperative, real-time collision prediction in ITS.
Read more →

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

arXiv:2601.17367v2 Announce Type: replace-cross Abstract: The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.
Read more →

From Specialist to Generalist: Unlocking SAM's Learning Potential on Unlabeled Medical Images

arXiv:2601.17934v2 Announce Type: replace-cross Abstract: Foundation models like the Segment Anything Model (SAM) show strong generalization, yet adapting them to medical images remains difficult due to domain shift, scarce labels, and the inability of Parameter-Efficient Fine-Tuning (PEFT) to exploit unlabeled data. While conventional models like U-Net excel in semi-supervised medical learning, their potential to assist a PEFT SAM has been largely overlooked. We introduce SC-SAM, a specialist-generalist framework where U-Net provides point-based prompts and pseudo-labels to guide SAM's adaptation, while SAM serves as a powerful generalist supervisor to regularize U-Net. This reciprocal guidance forms a bidirectional co-training loop that allows both models to effectively exploit the unlabeled data. Across prostate MRI and polyp segmentation benchmarks, our method achieves state-of-the-art results, outperforming other existing semi-supervised SAM variants and even medical foundation models like MedSAM, highlighting the value of specialist-generalist cooperation for label-efficient medical image segmentation. Our code is available at https://github.com/vnlvi2k3/SC-SAM.
Read more →

The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning

arXiv:2601.18832v2 Announce Type: replace-cross Abstract: Scaling test-time compute enhances long chain-of-thought (CoT) reasoning, yet existing approaches face a fundamental trade-off between computational cost and coverage quality: either incurring high training expense or yielding redundant trajectories. We introduce The Geometric Reasoner (TGR), a training-free framework that performs manifold-informed latent foresight search under strict memory bounds. At each chunk boundary, TGR scores candidate latent anchors via a lightweight look-ahead estimate combined with soft geometric regularizers that encourage smooth trajectories and diverse exploration. Chunk-wise KV cache resets keep memory linear in chunk length. On challenging math and code benchmarks, TGR improves robust trajectory coverage, measured by the area under the Pass@$k$ curve (AUC), by up to 13 points on Qwen3-8B, with negligible overhead of about 1.1--1.3 times.
Read more →

LLMs versus the Halting Problem: Revisiting Program Termination Prediction

arXiv:2601.18987v2 Announce Type: replace-cross Abstract: Determining whether a program terminates is a central problem in computer science. Turing's foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
Read more →

EVEREST: An Evidential, Tail-Aware Transformer for Rare-Event Time-Series Forecasting

arXiv:2601.19022v2 Announce Type: replace-cross Abstract: Forecasting rare events in multivariate time-series data is challenging due to severe class imbalance, long-range dependencies, and distributional uncertainty. We introduce EVEREST, a transformer-based architecture for probabilistic rare-event forecasting that delivers calibrated predictions and tail-aware risk estimation, with auxiliary interpretability via attention-based signal attribution. EVEREST integrates four components: (i) a learnable attention bottleneck for soft aggregation of temporal dynamics; (ii) an evidential head for estimating aleatoric and epistemic uncertainty via a Normal--Inverse--Gamma distribution; (iii) an extreme-value head that models tail risk using a Generalized Pareto Distribution; and (iv) a lightweight precursor head for early-event detection. These modules are jointly optimized with a composite loss (focal loss, evidential NLL, and a tail-sensitive EVT penalty) and act only at training time; deployment uses a single classification head with no inference overhead (approximately 0.81M parameters). On a decade of space-weather data, EVEREST achieves state-of-the-art True Skill Statistic (TSS) of 0.973/0.970/0.966 at 24/48/72-hour horizons for C-class flares. The model is compact, efficient to train on commodity hardware, and applicable to high-stakes domains such as industrial monitoring, weather, and satellite diagnostics. Limitations include reliance on fixed-length inputs and exclusion of image-based modalities, motivating future extensions to streaming and multimodal forecasting.
Read more →

CLIP-Guided Unsupervised Semantic-Aware Exposure Correction

arXiv:2601.19129v2 Announce Type: replace-cross Abstract: Improper exposure often leads to severe loss of details, color distortion, and reduced contrast. Exposure correction still faces two critical challenges: (1) the ignorance of object-wise regional semantic information causes the color shift artifacts; (2) real-world exposure images generally have no ground-truth labels, and its labeling entails massive manual editing. To tackle the challenges, we propose a new unsupervised semantic-aware exposure correction network. It contains an adaptive semantic-aware fusion module, which effectively fuses the semantic information extracted from a pre-trained Fast Segment Anything Model into a shared image feature space. Then the fused features are used by our multi-scale residual spatial mamba group to restore the details and adjust the exposure. To avoid manual editing, we propose a pseudo-ground truth generator guided by CLIP, which is fine-tuned to automatically identify exposure situations and instruct the tailored corrections. Also, we leverage the rich priors from the FastSAM and CLIP to develop a semantic-prompt consistency loss to enforce semantic consistency and image-prompt alignment for unsupervised training. Comprehensive experimental results illustrate the effectiveness of our method in correcting real-world exposure images and outperforms state-of-the-art unsupervised methods both numerically and visually.
Read more →

A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction

arXiv:2601.19175v2 Announce Type: replace-cross Abstract: Link sign prediction on a signed graph is a task to determine whether the relationship represented by an edge is positive or negative. Since the presence of negative edges violates the graph homophily assumption that adjacent nodes are similar, regular graph methods have not been applicable without auxiliary structures to handle them. We aim to directly model the latent statistical dependency among edges with the Gaussian copula and its corresponding correlation matrix, extending CopulaGNN (Ma et al., 2021). However, a naive modeling of edge-edge relations is computationally intractable even for a graph with moderate scale. To address this, we propose to 1) represent the correlation matrix as a Gramian of edge embeddings, significantly reducing the number of parameters, and 2) reformulate the conditional probability distribution to dramatically reduce the inference cost. We theoretically verify scalability of our method by proving its linear convergence. Also, our extensive experiments demonstrate that it achieves significantly faster convergence than baselines, maintaining competitive prediction performance to the state-of-the-art models.
Read more →

RPO-RAG: Aligning Small LLMs with Relation-aware Preference Optimization for Knowledge Graph Question Answering

arXiv:2601.19225v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have recently demonstrated remarkable reasoning abilities, yet hallucinate on knowledge-intensive tasks. Retrieval-augmented generation (RAG) mitigates this issue by grounding answers in external sources, e.g., knowledge graphs (KGs). However, existing KG-based RAG approaches rely on semantics-unaware path sampling and are weakly aligned with KG reasoning objectives, which limits further accuracy gains. They also feed retrieved paths directly into the reasoner without organizing them into answer-centered reasoning paths, hindering small LLMs' ability to leverage the retrieved knowledge. Furthermore, prior works predominantly rely on large LLMs (e.g., ChatGPT/GPT-4) or assume backbones above 7B parameters, leaving sub-7B models underexplored. We address this gap with RPO-RAG, the first KG-based RAG framework specifically designed for small LLMs, to the best of our knowledge. RPO-RAG introduces three key innovations: (1) a query-path semantic sampling strategy that provides informative supervisory signals; (2) a relation-aware preference optimization that aligns training with intermediate KG reasoning signals (e.g., relation); and (3) an answer-centered prompt design that organizes entities and reasoning paths in an interpretable format. Extensive experiments on two benchmark Knowledge Graph Question Answering (KGQA) datasets, WebQSP and CWQ, demonstrate that RPO-RAG effectively bridges the performance gap between small and large language models. On WebQSP, it improves F1 by up to 8.8%, reflecting enhanced answer precision, while on CWQ it achieves new state-of-the-art results among models under 8B parameters in both Hit and F1. Overall, RPO-RAG substantially improves the reasoning capability of small LLMs, even under 3B parameters-highlighting their potential for resource-efficient and practical on-device KGQA applications.
Read more →

Tri-Reader: An Open-Access, Multi-Stage AI Pipeline for First-Pass Lung Nodule Annotation in Screening CT

arXiv:2601.19380v2 Announce Type: replace-cross Abstract: Using multiple open-access models trained on public datasets, we developed Tri-Reader, a comprehensive, freely available pipeline that integrates lung segmentation, nodule detection, and malignancy classification into a unified tri-stage workflow. The pipeline is designed to prioritize sensitivity while reducing the candidate burden for annotators. To ensure accuracy and generalizability across diverse practices, we evaluated Tri-Reader on multiple internal and external datasets as compared with expert annotations and dataset-provided reference standards.
Read more →

R^3: Replay, Reflection, and Ranking Rewards for LLM Reinforcement Learning

arXiv:2601.19620v2 Announce Type: replace-cross Abstract: Large reasoning models (LRMs) aim to solve diverse and complex problems through structured reasoning. Recent advances in group-based policy optimization methods have shown promise in enabling stable advantage estimation without reliance on process-level annotations. However, these methods rely on advantage gaps induced by high-quality samples within the same batch, which makes the training process fragile and inefficient when intra-group advantages collapse under challenging tasks. To address these problems, we propose a reinforcement learning mechanism named \emph{\textbf{R^3}} that along three directions: (1) a \emph{cross-context \underline{\textbf{R}}eplay} strategy that maintains the intra-group advantage by recalling valuable examples from historical trajectories of the same query, (2) an \emph{in-context self-\underline{\textbf{R}}eflection} mechanism enabling models to refine outputs by leveraging past failures, and (3) a \emph{structural entropy \underline{\textbf{R}}anking reward}, which assigns relative rewards to truncated or failed samples by ranking responses based on token-level entropy patterns, capturing both local exploration and global stability. We implement our method on Deepseek-R1-Distill-Qwen-1.5B and train it on the DeepscaleR-40k in the math domain. Experiments demonstrate our method achieves SoTA performance on several math benchmarks, representing significant improvements and fewer reasoning tokens over the base models. Code and model will be released.
Read more →

ProToken: Token-Level Attribution for Federated Large Language Models

arXiv:2601.19672v2 Announce Type: replace-cross Abstract: Federated Learning (FL) enables collaborative training of Large Language Models (LLMs) across distributed data sources while preserving privacy. However, when federated LLMs are deployed in critical applications, it remains unclear which client(s) contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification. We present ProToken, a novel Provenance methodology for Token-level attribution in federated LLMs that addresses client attribution during autoregressive text generation while maintaining FL privacy constraints. ProToken leverages two key insights to enable provenance at each token: (1) transformer architectures concentrate task-specific signals in later blocks, enabling strategic layer selection for computational tractability, and (2) gradient-based relevance weighting filters out irrelevant neural activations, focusing attribution on neurons that directly influence token generation. We evaluate ProToken across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding). ProToken achieves 98% average attribution accuracy in correctly localizing responsible client(s), and maintains high accuracy when the number of clients are scaled, validating its practical viability for real-world deployment settings.
Read more →

LVLMs and Humans Ground Differently in Referential Communication

arXiv:2601.19792v2 Announce Type: replace-cross Abstract: For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs' limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.
Read more →

Aeronaut 1.0

New Mac app by Mikey Clarke, and it’s just what it says on the tin: a “lovingly crafted Bluesky app designed and built just for the Mac”. I’ve been beta testing Aeronaut for months, and it’s the only interface to Bluesky I actually like. It’s a real Mac app — written mostly in AppKit, supporting all the right UI idioms and platform integrations. It’s not just the best Bluesky client I’ve seen, for any platform, but maybe the best new Mac app I’ve seen in years, period. Certainly the one whose very existence has made me happiest. Next time someone tells me no one makes good new native apps for the Mac anymore, I’m going to tell them Mikey Fucking Clarke does. $2/month or $15/year. A veritable bargain for an app so nice. ★
Read more →

Bruce Springsteen: ‘Streets of Minneapolis’

Bruce Springsteen: I wrote this song on Saturday, recorded it yesterday and released it to you today in response to the state terror being visited on the city of Minneapolis. It’s dedicated to the people of Minneapolis, our innocent immigrant neighbors and in memory of Alex Pretti and Renee Good. Best line from the lyrics: Their claim was self-defense, Just don’t believe your eyes. It’s our blood and bones and these whistles and phones Against Miller and Noem’s dirty lies. Whistles, phones, and birds. ★
Read more →

★ Politics and the English Language, January 2026 Edition

Patrick McGee (author of last year’s bestseller, Apple in China, and guest on The Talk Show in May), commenting on Twitter/X re: Tim Cook’s company-wide memo regarding the “events in Minneapolis”: This literally says nothing, via intention and cowardice. It’s the kind of language Orwell attributed to politicians, when ready-made phrases assemble themselves and prevent any real thought from breaking through. I have previously linked to George Orwell’s seminal 1946 essay, “Politics and the English Language”. This time I’ll quote a different passage: In our time it is broadly true that political writing is bad writing. Where it is not true, it will generally be found that the writer is some kind of rebel, expressing his private opinions and not a “party line”. Orthodoxy, of whatever colour, seems to demand a lifeless, imitative style. The political dialects to be found in pamphlets, leading articles, manifestos, White papers and the speeches of undersecretaries do, of course, vary from party to party, but they are all alike in that one almost never finds in them a fresh, vivid, homemade turn of speech. When one watches some tired hack on the platform mechanically repeating the familiar phrases — bestial, atrocities, iron heel, bloodstained tyranny, free peoples of the world, stand shoulder to shoulder — one often has a curious feeling that one is not watching a live human being but some kind of dummy: a feeling which suddenly becomes stronger at moments when the light catches the speaker’s spectacles and turns them into blank discs which seem to have no eyes behind them. And this is not altogether fanciful. A speaker who uses that kind of phraseology has gone some distance toward turning himself into a machine. The appropriate noises are coming out of his larynx, but his brain is not involved, as it would be if he were choosing his words for himself. If the speech he is making is one that he is accustomed to make over and over again, he may be almost unconscious of what he is saying, as one is when one utters the responses in church. And this reduced state of consciousness, if not indispensable, is at any rate favourable to political conformity. In our time, political speech and writing are largely the defence of the indefensible. Things like the continuance of British rule in India, the Russian purges and deportations, the dropping of the atom bombs on Japan, can indeed be defended, but only by arguments which are too brutal for most people to face, and which do not square with the professed aims of the political parties. Thus political language has to consist largely of euphemism, question-begging and sheer cloudy vagueness. Defenceless villages are bombarded from the air, the inhabitants driven out into the countryside, the cattle machine-gunned, the huts set on fire with incendiary bullets: this is called pacification. Millions of peasants are robbed of their farms and sent trudging along the roads with no more than they can carry: this is called transfer of population or rectification of frontiers. People are imprisoned for years without trial, or shot in the back of the neck or sent to die of scurvy in Arctic lumber camps: this is called elimination of unreliable elements. Such phraseology is needed if one wants to name things without calling up mental pictures of them. Now consider Cook’s memo. Cook avoids most of the sins Orwell describes. He uses short, common words. He eschews hackneyed metaphors. He uses the active, not passive, voice — for the most part. His prayers and sympathies are “with everyone that’s been affected.” Who, exactly, has been affected? Affected how? By whom? Numerous examples come to mind, but not from Cook’s memo. Two Minneapolitans were affected, quite adversely, by being shot in the head and back at point blank range, in broad daylight, by unhinged ICE goons. A five-year-old boy — himself a U.S.-born citizen — was affected when ICE agents apprehended his father, used the boy as bait to lure other family members, and is now being held in a notorious detention center in Texas, a thousand miles away. The list is long, the stories searing. But Cook mentions nothing more specific than “everyone that’s been affected”. Such phraseology is needed if one wants to name things without calling up mental pictures of them, indeed. “This is a time for deescalation,” Cook wrote. But by whom? The masked federal agents laying siege to Minneapolis, brutalizing its citizenry? Or the thousands of law-abiding citizens protesting the occupation of their neighborhoods, who are, in the words of Seth Meyers, “deploying the most hurtful weapon of all, the bird”? Cook’s call for “deescalation” is meaningless without specifying which side he’s calling upon to change course, and there’s no weaker sauce than the weak sauce of “both sides”. Using words, not to make a point, but to avoid making a point while creating the illusion of having made one, is the true sin. From Orwell’s closing paragraph: Political language — and with variations this is true of all political parties, from Conservatives to Anarchists — is designed to make lies sound truthful and murder respectable, and to give an appearance of solidity to pure wind. It’s colder in Minnesota, but the wind is gusting in Cupertino.
Read more →

Tim Cook Wrote a Memo on the ‘Events in Minneapolis’

Tim Cook, in a company-wide memo (first published by Mark Gurman): Team, I’m heartbroken by the events in Minneapolis, and my prayers and deepest sympathies are with the families, with the communities, and with everyone that’s been affected. This is a time for deescalation. I believe America is strongest when we live up to our highest ideals, when we treat everyone with dignity and respect no matter who they are or where they’re from, and when we embrace our shared humanity. This is something Apple has always advocated for. I had a good conversation with the president this week where I shared my views, and I appreciate his openness to engaging on issues that matter to us all. I know this is very emotional and challenging for so many. I am proud of how deeply our teams care about the world beyond our walls. That empathy is one of Apple’s greatest strengths and it is something I believe we all cherish. Thank you for all that you do. Tim “Events” is doing a lot of work there, to describe what has happened and is happening in Minneapolis. Trump’s “openness” on this particular “issue” has been to replace Greg Bovino — the diminutive Himmler-cosplaying “commander at large” of Border Control, who insisted, adamantly, that the real victims in Alex Pretti’s murder were the Border Patrol agents who shot him — with “border czar” Tom Homan, a man who took a $50,000 cash bribe from undercover FBI agents in exchange for a promise to award them government contracts if Trump were reelected. Zac Hall, on Twitter/X: Cook took three days to not name Alex Pretti in his not public statement and 20 days to not name Renée Good in his not public statement. [...] 2020 Tim Cook on Apple’s homepage: “Right now, there is a pain deeply etched in the soul of our nation and in the hearts of millions. To stand together, we must stand up for one another, and recognize the fear, hurt, and outrage rightly provoked by the senseless killing of George Floyd and a much longer history of racism.” Quite the different message (and medium — this time with nothing on Apple’s website, let alone their homepage) from 2020, for what I consider far more outrageous and alarming killings. ★
Read more →

Meta’s Response to Reuters Report on ‘Romance AI Chatbots’ for Teenagers

Andy Stone, VP of communications at Meta, responding, in a series of tweets on Twitter/X, to Jeff Horwitz’s report at Reuters yesterday, linked here last night, which claimed that “Zuckerberg blocked curbs on sex-talking chatbots for minors”: Never let the facts get in the way of a good story, eh, @Reuters, @JeffHorwitz! The documents you cite in the story itself contradict this headline. The headline says “Zuckerberg blocked curbs on sex-talking chatbots for minors” But the story cites a document that says “Zuckerberg believed that AI companions should be blocked from engaging in sexually ‘explicit’ conversations” w young people. Huh?! After my post last night, a friend of mine, with a career of experience working in a large company, sent me this: A word of caution. “Scumbag middle manager says CEO said” is not the same as “CEO said.” I could believe Zuck shitcanned parental controls, but I am certain there are thousands of snakes inside that company who would lie about it to get what they want. That’s a good and fair point, and I think it’s what Stone is trying to emphasize above. The New Mexico lawsuit filing doesn’t contain evidence that Zuckerberg nixed parental controls for teens engaging in chats with AI bots; it contains evidence that other (unnamed employees) claimed in internal discussions that Zuckerberg had nixed them. That is different. But so let’s take Zuckerberg out of it personally. It’s still the case that Meta shipped these chatbots for teens to use. And the buck, presumably, stops at Zuck’s desk. Read Horwitz’s report from back in August, detailing a leaked internal document listing Meta’s content guidelines for generative AI chat. Sidenote: Why in the world is Meta’s VP of comms doing this on Twitter/X, not Threads, which continues to grow? ★
Read more →

Bliki: Excessive Bold

I'm increasingly seeing a lot of technical and business writing make heavy use of bold font weights, in an attempt to emphasize what the writers think is important. LLMs seem to have picked up and spread this practice widely. But most of this is self-defeating, the more a writer uses typographical emphasis, the less power it has, quickly reaching the point where it loses all its benefits. There are various typographical tools that are used to emphasize words and phrases, such as: bold, italic, capitals, and underlines. I find that bold is the one that's getting most of the over-use. Using a lot of capitals is rightly reviled as shouting, and when we see it used widely, it raises our doubts on the quality of the underlying thinking. Underlines have become the signal for hyperlinks, so I rarely see this for emphasis any more. Both capitals and underlines have also been seen as rather cheap forms of highlight, since we could do them with typewriters and handwriting, while bold and italics were only possible after the rise of word-processors. (Although I realize most of my readers are too young to remember when word-processors were novel.) Italics are the subtler form of emphasis. When I use them in a paragraph, they don't leap out to the eye. This allows me to use them in long flows of text when I want to set it apart, and when I use it to emphasize a phrase it only makes its presence felt when I'm fully reading the text. For this reason, I prefer to use italics for emphasis, but I only use it rarely, suggesting it's really important to put stress on the word should I be speaking the paragraph (and I always try to write in the way that I speak). The greatest value of bold is that draws the eye to the bold text even if the reader isn't reading, but glancing over the page. This is an important property, but one that only works if it's used sparingly. Headings are often done in bold, because the it's important to help the reader navigate a longer document by skimming and looking for headings to find the section I want to read. I rarely use bold within a prose paragraph, because of my desire to be parsimonious with bold. One use I do like is to highlight unfamiliar words at the point where I explain them. I got this idea from Giarratano and Riley. I noticed that when the unfamiliar term reappeared, I was often unsure what it meant, but glancing back and finding the bold quickly reminded me. The trick here is to place the bold at point of explanation, which is often, but not always, at its first use. 1 A common idea is to take an important sentence and bold that, so it leaps out while skimming the article. That can be worthwhile, but as ever with this kind of emphasize, its effectiveness is inversely proportional to how often it's used. It's also usually not the best tool for the job. Callouts usually work better. They do a superior job of drawing the eye, and furthermore they don't need to use the same words as in the prose text. This allows me to word the callout better than it could be if it also had to fit in the flow of the prose. A marginal case is where I see bold used in first clause of each item in a bulleted list. In some ways this is acting like a heading for the text in the list. But we don't need a heading for every paragraph, and the presence of the bullets does enough to draw the eye to the items. And bullet-lists are over used too - I always try to write such things as a prose paragraph instead, as prose flows much better than bullets and is thus more pleasant to read. It's important to write in such a way to make it an enjoyable experience for the reader - even, indeed especially, when I'm also trying to explain things for them. While writing this, I was tempted to illustrate my point by using excessive bold in a paragraph, showing the problem and hopefully demonstrating why lots of bold loses the power to emphasize and attract the skimming eye. But I also wanted to explain my position clearly, and I felt that illustrating the problem would thus undermine my attempt. So I've confined the example to a final flourish. (And, yes, I have seen text with as much bold as this.) Notes 1: For example, sometimes a new term will appear first in a list. Eg “We carry out this process in three steps: frobning, gibbling, and eorchisting”. In this case we don't bold the words as they appear in the list but later on when we explain what on earth they mean.
Read more →

Court Filing Claims Zuckerberg Blocked Curbs at Meta on Sex-Talking Chatbots for Minors

Jeff Horwitz, reporting for Reuters: Meta Chief Executive Mark Zuckerberg approved allowing minors to access AI chatbot companions that safety staffers warned were capable of sexual interactions, according to internal Meta documents filed in a New Mexico state court case and made public Monday. The lawsuit — brought by the state’s attorney general, Raul Torrez, and scheduled for trial next month — alleges that Meta “failed to stem the tide of damaging sexual material and sexual propositions delivered to children” on Facebook and Instagram. [...] Messages between two employees from March of 2024 state that Zuckerberg had rejected creating parental controls for the chatbots, and that staffers were working on “Romance AI chatbots” that would be allowed for users under the age of 18. We “pushed hard for parental controls to turn GenAI off — but GenAI leadership pushed back stating Mark decision,” one employee wrote in that exchange. Horwitz was with The Wall Street Journal for a long time; his is a byline worth paying attention to. ★
Read more →

'Halide' Co-Founder Sebastiaan de With Joins Apple's Design Team - MacRumors

Sebastiaan de With, co-founder of the popular iPhone camera app Halide, today announced that he has joined the Human Interface Design team at Apple. ...
Read more →

Good news, bad news: Samsung Galaxy TriFold finally has a U.S. release date and price tag - Mashable

It ain't cheap.
Read more →

Here's the Secret Galaxy S26 Ultra Display Security Feature You'll Want - Droid Life

You guys know what a privacy screen is, right? A privacy screen allows someone viewing a display straight-on to be able to view content, while any off angle brings a tint that at least attempts to hide whatever is on the screen. These are common in office spa…
Read more →

Moltbot Is Taking Over Silicon Valley - WIRED

People are letting the viral AI assistant formerly known as Clawdbot run their lives, regardless of the privacy concerns.
Read more →

Mastra empowers web devs to build AI agents in TypeScript

Python dominated the early days of machine learning, but that’s changing as AI becomes more mainstream. Take, for instance, the recent release of Mastra, an open source agentic AI framework that uses TypeScript rather than Python. Developers are less interested in what goes into a large language model and more intrigued by how to build an application on top of these models, according to Sam Bhagwat, Mastra’s co-creator and a full stack developer best known for his work as co-founder of the web framework Gatsby. Developers don’t have to know Python to build agents because they don’t require the same heavy computational work that work on models does, he says. “Build agents don’t tend to need to do that kind of heavy tough work,” Bhagwat tells The New Stack. “It’s a lot of ‘Hey, am I providing my agent with the right context at this time? Does it have the ability to call the right tools, to perform actions on behalf to the users that are using this. Can I get the right information, which is much closer to web app development?” And that’s the domain of frontend developers, he adds. “There’s this whole community, essentially, of full stack engineers that was being left out because we’re not really Python people. We’re JavaScript types,” he says. “We wanted to make a great tool for them.” Why TypeScript? TypeScript has become a sort of default language for modern product teams, Bhagwat tells TNS. “TypeScript tends to be better for web app development because your frontend is going to be written in JavaScript, in TypeScript, pretty much no matter what,” Bhagwat says. “When you have the backend of that written also in TypeScript, you just have a nicer integration.” It also opens up AI agents to a world of TypeScript-savvy developers. In fact, last year GitHub revealed that TypeScript overtook both Python and JavaScript as the most used language on its platform. The adoption shift “marks the most significant language shift in more than a decade,” the GitHub team says. Starting with AI agents Agents are already changing how we interface with the internet, according to Bhagwat. “It’s really interesting for people that are in this Dev Tools world, because we’re moving from a world where humans are writing code to where people are writing with Claude Code or Cursor. That changes a few things,” he says. Increasingly, people are using internal documents with AI, which typically looks for markdown. “If an agent is browsing the web and looking for a doc, because it’s a coding agent, it’s typically looking for markdown, and so it sends a request for markdown,” he explains. “Now some people are changing the content of their docs, specifically adding special instructions for agents, because they can tell who the visitors are.” “…we’re moving from a world where humans are writing code to where people are writing with Claude Code or Cursor.” – Sam Bhagwat, Mastra co-creator How important is it that web developers learn to build AI agents? Bhagwat sees more businesspeople using AI to code solutions and even to train their own agents. Then there’s this: Frontend cloud provider Vercel CEO and creator of Next.js, Guillermo Rauch, has warned that the next evolution of frontend development will focus on building AI agents. Right now, developers are tinkering and learning about building AI agents as they usually do: Through personal side projects. For instance, Bhagwat needs to make a shopping list every week. He built an agent that understands the dietary preferences of his household. Many developers are launching similar personal projects so they can understand the technology before they have to deploy it in the enterprise, he says. To help developers get started, Bhagwat has written an e-book, Principles of building AI agents, that brings developers up to speed on what they need to know about agents and building with Mastra. It’s available for free download with email registration. He has also written a second book, Patterns for building AI agents, also available for email registration. What Mastra offers out of the box If you’ve used Replit to build an agent, you’ve used Mastra, according to Bhagwat. But let’s take a look at what you get with the full framework. Mastra offers a few core framework primitives, starting with agents, which are autonomous code that use the LLMs, specific prompt instructions, and tools to complete user requests. It also supports workflows, which allow developers to orchestrate complex, multistep processes, And of course it incorporates RAG (Retrieval-Augmented Generation) functionalities, with built-in support for data syncing, web scraping, and vector database management. It offers an MCP server that lets users provide a local copy of documentation to the AI. The tool has both short-term and long-term memory systems that allow agents to remember context across threads and sessions. Mastra users also have access to tools, specifically: Mastra Studio, a local developer playground where web developers can visualize, test, and debug agents and workflows in real-time. A model context protocol (MCP) Client, which allows developers to connect agents to pre-built tools, such as Google Sheets, GitHub, or internal databases — all without writing custom integrations; AI tracing and observability, so developers can see how the LLM is reasoning, as well as providing the token counts and execution steps; and Scorers and Evals, which are tools that measure the performance and accuracy of AI agents using model-graded or rule-based metrics. These are designed to help developers refine prompts before shipping to production. The company also offers a fully managed cloud platform for zero-config deployments. Framework Support The Mastra team has built in integrations with some frontend frameworks, including: Next.js Nuxt (Vue) Astro SvelteKit React and Vite On the backend, it supports: Express Hono Fastify Koa Mastra also integrates with agentic UI libraries that help web developers specifically build agentic frontend experiences, such as: CopilotKit, an open source framework that helps build Copilot experiences directly inside existing applications; and Assistant UI, an open source TypeScript and React library that helps developers build high-quality AI chat interfaces. The post Mastra empowers web devs to build AI agents in TypeScript appeared first on The New Stack.
Read more →

Kubernetes telemetry feature fully compromises clusters

If Kubernetes admins don’t have enough to worry about with the upcoming Nginx gateway cutoff, they now may need to rifle through their Helm charts to potentially thwart a dangerous setting. Security researcher Graham Helton has shared a Kubernetes vulnerability he unearthed that allows some random user, armed with read-only permission, to run arbitrary and even privileged commands on any pod in a cluster. His trick is to use a service account with permissions for the Kubernetes nodes/proxy GET resource, which is used by dozens of monitoring tools, and provides access for issuing privileged-level pod commands. In other words, it’s a feature, not a bug. Working as intended Helton initially reported the quirk as a bug in November through the Kubernetes bug bounty program. The issue was soon marked closed, marked as “intended behavior.” Thenodes/proxy GET call is intended for service accounts, and is used by many monitoring tools. How a get request gets transformed into a full remote code execution is due to a mismatch between Websockets and the Kublet’s authorization logic. Helton found Helm charts for 69 tools that usednodes/proxy GET. For them, it provides the permissions to reach a node’s internal API to get the data they need. “Some of the worlds biggest kubernetes vendors rely on it because there is no generally available alternative,” Helton writes on X. So, no CVE alert for nodes/proxy Get behavior, because its not a vulnerability. The official path forward is to use KEP-2862 (“Fine-Grained Kubelet API Authorization”), an extension expected in the upcoming Kubernetes 1.36 release, expected in April. How to bring down a Kubernetes cluster So, if you have a service account that’s subscribed to nodes/proxy GET, and can reach a Node’s Kubelet on port 10250, then you are free to issue any command to /exec endpoints, including commands for privileged system pods that could destroy the cluster entirely. Here are some other things you can do, according to Helton: steal service account tokens from other pods, or execute code in control plane pods. Worse yet, no record would be left of such malicious actions, as the “Kubernetes AuditPolicy does not log commands executed through a direct connection to the Kubelet’s API,” Helton explains. Here is the cluster’s permission set that makes this all possible: # Vulnerable ClusterRole apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: nodes-proxy-reader rules: - apiGroups: [""] \ resources: ["nodes/proxy"] verbs: ["get"] If you want to try it out for yourself, Helton posted an entire lab. Precautions to take? Hard questions may have to be asked for those with these system settings: Do you value your telemetry more than your security? Industry observer Alex Ellis calls the disclosure “worrying.” Cloud native security company Edera field CTO Jed Salazar notes the vulnerability points out how Kubernetes workloads are different 2026 than they were in 2016. They’re no longer just stateless apps. They’re “AI training pipelines with proprietary model weights, financial trading systems, and healthcare applications with patient data,” he writes. “The blast radius of a monitoring stack compromise in 2026 is categorically different from 2016.” The answer, Salazar writes, is architectural isolation, which is what Edera offers (the configuration did not leave Edera users vulnerable, Salazar notes). For everyone else, until KEP-2862 fully trickles down to production, Salazar advised a number of precautions: Audit your RBAC policies for nodes/proxy permissions immediately, Consider whether monitoring tools truly need direct kubelet access, Implement network policies restricting access to kubelet port 10250, Plan your migration to KEP-2862 fine-grained permissions when they GA, Adopt workload isolation technologies that limit blast radius regardless of upstream decisions. To those who use multitenant Kubernetes: It’s not a matter of if you’ll get pwn’d It’s when https://t.co/RkgcOhwWjq — Jake (@JustJake) January 28, 2026 The post Kubernetes telemetry feature fully compromises clusters appeared first on The New Stack.
Read more →

Show HN: I'm building an AI-proof writing tool. How would you defeat it?

Comments
Read more →

With Auto Browse, Google Chrome can now surf the web for you

For a while now, Google has been on a mission to extend its Chrome browser with more AI features, powered by its Gemini models. Now, the company is launching a slew of new AI features based on Gemini 3 that include a new Auto Browse feature, a new AI side panel for interacting with Gemini and an integration with its Nano Banana imaging model. It is also bringing built-in Gemini support to Chromebooks. Let Chrome browse for you The flagship feature of this launch is Auto Browse. This isn’t a new idea, of course. Many startups have been developing tools to let an AI agent browse for you, but Google’s first mainstream release in this area is definitely notable. During a press conference ahead of the announcement, Charmaine D’Silva, Chrome’s Director of Product Management, noted that “Auto Browse is when we make Gemini in Chrome feel truly agentic.” Available to paying Gemini Pro and Max subscribers soon, Auto Browse will let you describe a task — be that finding a flight to Vegas or orchestrating a more complex professional workflow — and then the browser will try to complete as much of that task for you as it can. Auto Browse will also support Google’s recently launched Universal Commerce Protocol, making it easier for the agent to browse shopping sites and get you to the checkout page. Letting an AI agent browse for you isn’t new, of course, and there are even a few other auto-browse tools on the market with the same name. With Project Mariner, Google’s DeepMind unit itself demonstrated some of these capabilities in the spring of 2025. To try out Project Mariner meant having a Gemini Ultra account, though. Now, a Pro plan is enough to give this a spin. How useful this feature is will depend a lot on the speed. Many similar projects still suffer from being extremely slow. One advantage of Auto Browse is that you can let it work in one tab and continue to work in another, which may alleviate this problem a bit. Chrome gets a Gemini side panel The most obvious new UI features is the new side panel. Until now, to open Gemini in Chrome, you had to click on the Gemini icon at the top right of the browser, which would then open a pop-up window. That always felt like a somewhat provisional and disconnected user experience and made it harder to use Gemini while also browsing other tabs. D’Silva acknowledges as much. “Through that launch, we did get a lot of feedback from our users saying that one thing they really, really missed out on is the ability to actually have many of these conversations going at the same time.” Now, the new sidepanel lets you interact with the model independently, but it is also aware of the browser context, of course. And because it knows what you are doing in the browser, you can now also ask it to, for example, compare different products you are researching across browser tabs. Gemini in the new Chrome side panel (Credit: Google). Nano Banana in the side panel With this new side panel, users can also invoke Google’s Nano Banana model to transform any images that are currently open in the browser. Some of those may be copyrighted or news images, which opens up all kinds of issues, especially because this makes it far easier to manipulate them since there’s no need to download the images and then upload them to a model again. Nano Banana does have built-in guardrails, but where there is a will, there is usually a way. Looking beyond the browser context, Gemini in Chrome was also recently updated to support Connected Apps. That’s Google’s term for integrations with services like Gmail, Google Calendar, YouTube, Maps, Google Flights, and others. As Google notes in its announcement, that means if you’re going to a conference, Gemini can find the context for that trip and search Google Flights for matching flights — and then draft an email to tell your colleagues when you will arrive in Las Vegas, a city you really didn’t want to have to travel to yet again, so you were glad Gemini helped you keep the cognitive load low. The post With Auto Browse, Google Chrome can now surf the web for you appeared first on The New Stack.
Read more →

Disney Afternoon Collection Pops Up On Switch 1 & 2 eShop, Includes Two Additional Games - Nintendo Life

Digital Eclipse yet to confirm
Read more →

Dispatch is censored on Nintendo Switch 2 and Switch - Nintendo Everything

Although there was initially speculation about Dispatch being censored on Nintendo Switch 2 and Switch, it’s now confirmed that this is indeed the case. In the PC and PS5 versions, the game featured a “Visual Censorship” setting. This is gone on Nintendo Swit…
Read more →

A decade of werf, a software delivery tool for Kubernetes

Various cloud native projects are celebrating their first decade. While there are obvious big names, such as Kubernetes itself, Helm, and Cilium, the ecosystem is much wider and includes lesser-known tools that have been around for a while as well. I’ve been involved in the werf project since its inception more than 10 years ago, and I’d like to share its story to contribute to a broader picture of the present cloud native world. What is werf? Basically, it’s an opinionated, all-in-one command-line interface (CLI) tool for building container images and deploying them to Kubernetes. At this point, you might be wondering why anyone might need it, given that today we already have so many other tools performing similar tasks. Good question! To address it, I suggest diving into the project’s history to understand its peculiarities, reasoning behind it, and evolution. Werf originated from a Linux-focused DevOps service provider that was challenged by automating numerous container orchestration routines for various customers. It happened in 2015-2016 when containers were already widely used, and even Kubernetes existed, though it was not yet very popular. When we talk about running applications as containers, what was the first thing the engineers did with them? They built the images by invoking docker build and several other commands. That’s exactly how werf started: Making a wrapper for these actions and improving this process by embedding several enhancements, such as different build stages, smart caching, adding third-party artifacts, and even Chef support. (Again, it was when such configuration management tools were widely used, and this kind of integration seemed natural for the Ops world.) Importantly, it was not a universal type of wrapper. From its inception, it was a tool strongly focused on automating well-established workflows and, thus, forcing specific views on how the images should be built, tagged, etc. While the approach was opinionated, it was based on real-world experience operating infrastructure, not for a single company but for numerous customers across varying industries and sizes. Facilitating the same best practices for orchestrating containers was the very sense of creating werf, and this ideology has lasted since then. The next big capability werf got was deployment. After you build a container image, you want to run it in some environment, right? At that time (2017), it was obvious that Kubernetes would be the preferred platform to run containerized workloads, and Helm was already around. Thus, werf used Helm to implement deployment to Kubernetes. It was another core idea of the project: despite being opinionated about how you want to get your work done, the fundamental technologies you’re using to achieve that are mainstream: git, Docker, Helm, Kubernetes… In some way, werf became a “glue” for these technologies — for example, you could execute one command, werf converge, to (re-)build your app and (re-)deploy it to Kubernetes. Over the years, other significant capabilities and best practices were added to werf, such as parallel builds, content-based image tagging, advanced resource tracking during deployment, a sophisticated approach to cleaning up the container registry, bundles for distributing release artifacts, ready-to-use integrations for GitLab CI/CD and GitHub Actions, so-called Giterminism for hermetic builds, and so on. Eventually, werf evolved into an all-in-one CLI tool with many features on board, proven over years of production use by its creator and later by many other companies. Another fundamental idea behind the project contributed to this growing adoption. The creators of werf were heavily involved in open source, as their entire service business was built on deploying, configuring, and maintaining Linux servers and endless other open source software needed for web services. This made werf open source and available on GitHub from the very beginning for both ideological and practical — or should I say professional — reasons. Interestingly, implementing many features in werf resulted in creating other open source projects that turned out to be helpful on their own. Some of such examples include: Nelm is a Helm fork that enhances its capabilities in many ways, including advanced resource tracking during deployments, flexible ordering for deployed resources, improved CRDs (custom resource definitions) management, and deployment planning. Today, Nelm is not only used as a deployment engine in werf, but also as a standalone CLI tool on its own by many users. Trdl is a solution for delivering software updates securely from the git repository to the end user. It is used as a default and preferred way of installing/invoking the werf binary. Kubedog is a library to watch and follow Kubernetes resources during deployment. Nelm uses it to track resources, and some other tools benefit from leveraging it as well. Lockgate is a distributed locking library for Go. Seeing that more users adopt werf for their needs and are willing to provide stronger guarantees for the project, we decided to donate it to a trusted foundation. At the end of 2022, werf became a Cloud Native Computing Foundation (CNCF) Sandbox project, signaling to the wider tech community that this tool will stay open source and won’t be owned by a single vendor. Over the past decade, the cloud native ecosystem has evolved significantly, offering software engineers an impressive variety of tools and solutions. At the same time, werf has undergone a long way from a simple wrapper to a comprehensive solution. Being an opinionated and all-in-one tool, it is mostly focused on specific use cases nowadays, such as other DevOps agencies, organizations that want to ensure strict rules for delivering software in Kubernetes, and just some users who like the principles facilitated by werf. (As a side note, perhaps the werf’s subproject Nelm has more potential for a massive adoption.) Nevertheless, werf is actively used in at least 18,000 projects worldwide today and maintains a robust pace of development, continuously adding unique features for its current and potential users. For me, the story of werf illustrates how a passionate, consistent development of in-house tooling, coupled with dedication to open source, can benefit a wider engineering community and maybe even inspire others. The post A decade of werf, a software delivery tool for Kubernetes appeared first on The New Stack.
Read more →

From pixels to characters: The engineering behind GitHub Copilot CLI’s animated ASCII banner

Most people think ASCII art is simple, and a nostalgic remnant of the early internet. But when the GitHub Copilot CLI team asked for a small entrance banner for the new command-line experience, they discovered the opposite: An ASCII animation in a real-world terminal is one of the most constrained UI engineering problems you can take on. Part of what makes this even more interesting is the moment we’re in. Over the past year, CLIs have seen a surge of investment as AI-assisted and agentic workflows move directly into the terminal. But unlike the web—where design systems, accessibility standards, and rendering models are well-established—the CLI world is still fragmented. Terminals behave differently, have few shared standards, and offer almost no consistent accessibility guidelines. That reality shaped every engineering decision in this project. Different terminals interpret ANSI color codes differently. Screen readers treat fast-changing characters as noise. Layout engines vary. Buffers flicker. Some users override global colors for accessibility. Others throttle redraw speed. There is no canvas, no compositor, no consistent rendering model, and no standard animation framework. By the numbers 3 seconds of animation ~20 frames ~6,000 lines of TypeScript Dozens of terminal + theme combinations tested So when an animated Copilot mascot flying into the terminal appeared, it looked playful. But behind it was serious engineering work, unexpected complexity, a custom design toolchain, and a tight pairing between a designer and a long-time CLI engineer. That complexity only became fully visible once the system was built. In the end, animating a three-second ASCII banner required over 6,000 lines of TypeScript—most of it dedicated not to visuals, but to handling terminal inconsistencies, accessibility constraints, and maintainable rendering logic. This is the technical story of how it came together. 📦 What’s new in GitHub Copilot CLI GitHub Copilot CLI brings agentic workflows directly into your terminal—letting you plan projects, modify files, run commands, use custom agents, and delegate tasks to the cloud, all without leaving the CLI. Since its introduction, Copilot CLI has expanded to support richer, more flexible agentic workflows: Works the way you do with persistent memory, infinite sessions, and intelligent compaction Helps you think using explore, plan, and review workflows where you can choose the model at each step Executes on your behalf with custom agents, agent skills, full MCP support, and async task delegation Want to bring these same agentic capabilities into your own tools or products? The GitHub Copilot SDK exposes the same execution loop that powers Copilot CLI, so you can embed agents into any application using your Copilot subscription or your own model keys. Learn more about the Copilot SDK > Why animated ASCII is a hard engineering problem Before diving into the build process, it’s worth calling out why this problem space is more advanced than it looks. Terminals don’t have a canvas Unlike browsers (DOM), native apps (views), or graphics frameworks (GPU surfaces), terminals treat output as a stream of characters. There’s no native concept of: Frames Sprites Z-index Rasterized pixels Animation tick rates Because of this, every “frame” has to be manually repainted using cursor movements and redraw commands. There’s no compositor smoothing anything over behind the scenes. Everything is stdout writes + ANSI control sequences. ANSI escape codes are inconsistent, and terminal color is its own engineering challenge ANSI escape codes like \x1b[35m (bright magenta) or \x1b[H (cursor home) behave differently across terminals—not just in how they render, but in whether they’re supported at all. Some environments (like Windows Command Prompt or older versions of PowerShell) have limited or no ANSI support without extra configuration. But even in terminals that do support ANSI, the hardest part isn’t the cursor movement. It’s the colors. When you’re building a CLI, you realistically have three approaches: Use no color at all. This guarantees broad compatibility, but makes it harder to highlight meaning or guide users’ attention—especially in dense CLI output. Use richer color modes (3-bit, 4-bit, 8-bit, or truecolor) that aren’t uniformly supported or customizable. This introduces a maintenance headache: Different terminals, themes, and accessibility profiles render the same color codes differently, and users often disagree about what “good” colors look like. Use a minimal, customizable palette (usually 4-bit colors) that most terminals allow users to override in their preferences. This is the safest path, but it limits how accurately you can represent a brand palette—and it forces you to design for environments with widely varying contrast and theme choices. For the Copilot CLI animation, this meant treating color as a semantic system, not a literal one: Instead of committing specific RGB values, the team mapped high-level “roles” (eyes, goggles, shadow, border) to ANSI colors that degrade gracefully across different terminals and accessibility settings. Accessibility is a first-class concern Terminals are used by developers with a wide range of visual abilities—not just blind users with screen readers, but also low-vision users, color-blind users, and anyone working in high-contrast or customized themes. That means: Rapid re-renders can create auditory clutter for screen readers Color-based meaning must degrade safely, since bold, dim, or subtle hues may not be perceivable Low-vision users may not see contrast differences that designers expect Animations must be opt-in, not automatic Clearing sequences must avoid confusing assistive technologies This is also why the Copilot CLI animation ended up behind an opt-in flag early on—accessibility constraints shaped the architecture from the start. These constraints guided every decision in the Copilot CLI animation. The banner had to work when colors were overridden, when contrast was limited, and even when the animation itself wasn’t visible. Ink (React for the terminal) helps, but it’s not an animation engine Ink lets you build terminal interfaces using React components, but: It re-renders on every state change It doesn’t manage frame deltas It doesn’t synchronize with terminal paint cycles It doesn’t solve flicker or cursor ghosting Which meant animation logic had to be handcrafted. Frame-based ASCII animation has no existing workflow for designers There are tools for ASCII art, but virtually none for: Frame-by-frame editing Multi-color ANSI previews Exporting color roles Generating Ink-ready components Testing contrast and accessibility Even existing ANSI preview tools don’t simulate how different terminals remap colors or handle cursor updates, which makes accurate design iteration almost impossible without custom tooling. So the team had to build one. Part 1: A request that didn’t fit any workflow Cameron Foxly (@cameronfoxly), a brand designer at GitHub with a background in animation, was asked to create a banner for the Copilot CLI. “Normally, I’d build something in After Effects and hand off assets,” Cameron said. “But engineers didn’t have the time to manually translate animation frames into a CLI. And honestly, I wanted something more fun.” He’d seen the static ASCII intro in Claude Code and knew Copilot deserved more personality. The 3D Copilot mascot flying in to reveal the CLI logo felt right. But after attempting to create just one frame manually, the idea quickly ran into reality. “It was a nightmare,” Cameron said. “If this is going to exist, I need to build my own tool.” Part 2: Building an ASCII animation editor from scratch Cameron opened an empty repository in VS Code, and began asking GitHub Copilot for help scaffolding an animation MVP that could: Read text files as frames Render them sequentially Control timing Clear the screen without flicker Add a primitive “UI” Within an hour, he had a working prototype that was monochrome, but functional. Simplified early animation loop Below is a simplified example variation of the frame loop logic Cameron prototyped: import fs from "fs"; import readline from "readline"; /** * Load ASCII frames from a directory. */ const frames = fs .readdirSync("./frames") .filter(f => f.endsWith(".txt")) .map(f => fs.readFileSync(`./frames/${f}`, "utf8")); let current = 0; function render() { // Move cursor to top-left of terminal readline.cursorTo(process.stdout, 0, 0); // Clear the screen below the cursor readline.clearScreenDown(process.stdout); // Write the current frame process.stdout.write(frames[current]); // Advance to next frame current = (current + 1) % frames.length; } // 75ms = ~13fps. Higher can cause flicker in some terminals. setInterval(render, 75); This introduced the first major obstacle: color. The prototype worked in monochrome, but the moment color was added, inconsistencies across terminals—and accessibility constraints—became the dominant engineering problem. Part 3: ANSI color theory and the real-world limitations The Copilot brand palette is vibrant and high-contrast, which is great for web but exceptionally challenging for terminals. ANSI terminals support: 16-color mode (standard) 256-color mode (extended) Sometimes truecolor (“24-bit”) but inconsistently Even in 256-color mode, terminals remap colors based on: User themes Accessibility settings High-contrast modes Light/dark backgrounds OS-level overrides Which means you can’t rely on exact hues. You have to design with variability in mind. Cameron needed a way to paint characters with ANSI color roles while previewing how they look in different terminals. He took a screenshot of the Wikipedia ANSI table, handed it to Copilot, and asked it to scaffold a palette UI for his tool. Adding a color “brush” tool A simplified version: function applyColor(char, color) { // Minimal example: real implementation needed support for roles, // contrast testing, and multiple ANSI modes. const codes = { magenta: "\x1b[35m", cyan: "\x1b[36m", white: "\x1b[37m" }; return `${codes[color]}${char}\x1b[0m`; // Reset after each char } This enabled Cameron to paint ANSI-colored ASCII like you would in Photoshop, one character at a time. But now he had to export it into the real Copilot CLI codebase. Part 4: Exporting to Ink (React for the terminal) Ink is a React renderer for building CLIs using JSX components. Instead of writing to the DOM, components render to stdout. Cameron asked Copilot to help generate an Ink component that would: Accept frames Render them line-by-line Animate them with state updates Integrate cleanly into the CLI codebase Simplified Ink frame renderer import React from "react"; import { Box, Text } from "ink"; /** * Render a single ASCII frame. */ export const CopilotBanner = ({ frame }) => ( <Box flexDirection="column"> {frame.split("\n").map((line, i) => ( <Text key={i}>{line}</Text> ))} </Box> ); And a minimal animation wrapper: export const AnimatedBanner = () => { const [i, setI] = React.useState(0); React.useEffect(() => { const id = setInterval(() => setI(x => (x + 1) % frames.length), 75); return () => clearInterval(id); }, []); return <CopilotBanner frame={frames[i]} />; }; This gave Cameron the confidence to open a pull request (his first engineering pull request in nine years at GitHub). “Copilot filled in syntax I didn’t know,” Cameron said. “But I still made all the architectural decisions.” Now it was time for the engineering team to turn a prototype into something production-worthy. Part 5: Terminal animation isn’t solved technology Andy Feller (@andyfeller), a long-time GitHub engineer behind the GitHub CLI, partnered with Cameron to bring the animation into the Copilot CLI codebase. Unlike browsers—which share rendering engines, accessibility APIs, and standards like WCAG—terminal environments are a patchwork of behaviors inherited from decades-old hardware like the VT100. There’s no DOM, no semantic structure, and only partial agreement on capabilities across terminals. This makes even “simple” UI design problems in the terminal uniquely challenging, especially as AI-driven workflows push CLIs into daily use for more developers. “There’s no framework for terminal animations,” Andy explained. “We had to figure out how to do this without flickering, without breaking accessibility, and across wildly different terminals.” Andy broke the engineering challenges into four broad categories: Challenge 1: From banner to ready without flickering Most terminals repaint the entire viewport when new content arrives. At the same time, CLIs come with a strict usability expectation: when developers run a command, they want to get to work immediately. Any animation that flickers, blocks input, or lingers too long actively degrades the experience. This created a core tension the team had to resolve: how to introduce a brief, animated banner without slowing startup, stealing focus, or destabilizing the terminal render loop. In practice, this was complicated by the fact that terminals behave differently under load. Some: Throttle fast writes Reveal cleared frames momentarily Buffer output differently Repaint the cursor region inconsistently To avoid flicker while keeping the CLI responsive across popular terminals like iTerm2, Windows Terminal, and VS Code, the team had to carefully coordinate several interdependent concerns: Keeping the animation under three seconds so it never delayed user interaction Separating static and non-static components to minimize unnecessary redraws Initializing MCP servers, custom agents, and user setup without blocking render Working within Ink’s asynchronous re-rendering model The result was an animation treated as a non-blocking, best-effort enhancement—visible when it could be rendered safely, but never at the expense of startup performance or usability. Challenge 2: Brand color mapping in ANSI “ANSI color consistency simply doesn’t exist,” Andy said. Most modern terminals support 8-bit color, allowing CLIs to choose from 256 colors. However, how those colors are actually rendered varies widely based on terminal themes, OS settings, and user accessibility overrides. In practice, CLIs can’t rely on exact hues—or even consistent contrast—across environments. The Copilot banner introduced an additional complexity: although it’s rendered using text characters, the block-letter Copilot logo functions as a graphical object, not readable body text. Under accessibility guidelines, non-text graphical elements have different contrast requirements than text, and they must remain perceivable without relying on fine detail or precise color matching. To account for this, the team deliberately chose a minimal 4-bit ANSI palette—one of the few color modes most terminals allow users to customize—to ensure the animation remained legible under high-contrast themes, low-vision settings, and color overrides. This meant the team had to: Treat the Copilot wordmark as non-text graphical content with appropriate contrast requirements Select ANSI color codes that approximate the Copilot palette without relying on exact hues Satisfy WCAG contrast guidance for both text and non-text elements Ensure the animation remained legible in light and dark terminals Degrade gracefully when users override terminal colors for accessibility Test color combinations across multiple terminal emulators and theme configurations Rather than encoding brand colors directly, the animation maps semantic roles—such as borders, eyes, highlights, and text—to ANSI color slots that terminals can reinterpret safely. This allows the banner to remain recognizable without assuming control over the user’s color environment. Challenge 3: Making the animation maintainable Cameron’s prototype was a great starting point for Andy to incorporate into the Copilot CLI but it wasn’t without its challenges: Banner consisted of ~20 animation frames covering an 11×78 area There are ~10 animation elements to stylize in any given frame Needed a way to separate the text of the frame from the colors involved Each frame mapped hard coded colors to row and column coordinates Each frame required precise timing to display Cameron’s vision First, the animation was broken down into distinct animation elements that could be used to create separate light and dark themes: type AnimationElements = | "block_text" | "block_shadow" | "border" | "eyes" | "head" | "goggles" | "shine" | "stars" | "text"; type AnimationTheme = Record<AnimationElements, ANSIColors>; const ANIMATION_ANSI_DARK: AnimationTheme = { block_text: "cyan", block_shadow: "white", border: "white", eyes: "greenBright", head: "magentaBright", goggles: "cyanBright", shine: "whiteBright", stars: "yellowBright", text: "whiteBright", }; const ANIMATION_ANSI_LIGHT: AnimationTheme = { block_text: "blue", block_shadow: "blackBright", border: "blackBright", eyes: "green", head: "magenta", goggles: "cyan", shine: "whiteBright", stars: "yellow", text: "black", }; Next, the overall animation and subsequent frames would capture content, color, duration needed to animate the banner: interface AnimationFrame { title: string; duration: number; content: string; colors?: Record<string, AnimationElements>; // Map of "row,col" positions to animation elements } interface Animation { metadata: { id: string; name: string; description: string; }; frames: AnimationFrame[]; } Then, each animation frame was captured to separate frame content from stylistic and animation details, resulting in over 6,000 lines of TypeScript to safely animate three seconds of the Copilot logo across terminals with wildly different rendering and accessibility behaviors: const frames: AnimationFrame[] = [ { title: "Frame 1", duration: 80, content: ` ┌┐ ││ ││ └┘`, colors: { "1,0": "border", "1,1": "border", "2,0": "border", "2,1": "border", "10,0": "border", "10,1": "border", "11,0": "border", "11,1": "border", }, }, { title: "Frame 2", duration: 80, content: ` ┌── ──┐ │ │ █▄▄▄ ███▀█ ███ ▐▌ ███ ▐▌ ▀▀█▌ ▐ ▌ ▐ │█▄▄▌ │ └▀▀▀ ──┘`, colors: { "1,0": "border", "1,1": "border", "1,2": "border", "1,8": "border", "1,9": "border", "1,10": "border", "2,0": "border", "2,10": "border", "3,1": "head", "3,2": "head", "3,3": "head", "3,4": "head", "4,1": "head", "4,2": "head", "4,3": "goggles", "4,4": "goggles", "4,5": "goggles", "5,1": "head", "5,2": "goggles", "5,3": "goggles", "5,5": "goggles", "5,6": "goggles", "6,1": "head", "6,2": "goggles", "6,3": "goggles", "6,5": "goggles", "6,6": "goggles", "7,3": "goggles", "7,4": "goggles", "7,5": "goggles", "7,6": "goggles", "8,3": "eyes", "8,5": "head", "9,4": "head", "10,0": "border", "10,1": "head", "10,2": "head", "10,3": "head", "10,4": "head", "10,10": "border", "11,0": "border", "11,1": "head", "11,2": "head", "11,3": "head", "11,8": "border", "11,9": "border", "11,10": "border", }, }, Finally, each animation frame is rendered building segments of text based on consecutive color usage with the necessary ANSI escape codes: {frameContent.map((line, rowIndex) => { const truncatedLine = line.length > 80 ? line.substring(0, 80) : line; const coloredChars = Array.from(truncatedLine).map((char, colIndex) => { const color = getCharacterColor(rowIndex, colIndex, currentFrame, theme, hasDarkTerminalBackground); return { char, color }; }); // Group consecutive characters with the same color const segments: Array<{ text: string; color: string }> = []; let currentSegment = { text: "", color: coloredChars[0]?.color || theme.COPILOT }; coloredChars.forEach(({ char, color }) => { if (color === currentSegment.color) { currentSegment.text += char; } else { if (currentSegment.text) segments.push(currentSegment); currentSegment = { text: char, color }; } }); if (currentSegment.text) segments.push(currentSegment); return ( <Text key={rowIndex} wrap="truncate"> {segments.map((segment, segIndex) => ( <Text key={segIndex} color={segment.color}> {segment.text} </Text> ))} </Text> ); })} Challenge 4: Accessibility-first design The engineering team approached the banner with the same philosophy as the GitHub CLI’s accessibility work: Respect global color overrides both in terminal and system preferences After the first use, avoid animations unless explicitly enabled via the Copilot CLI configuration file Minimize ANSI instructions that can confuse assistive tech “CLI accessibility is under researched,” Andy noted. “We’ve learned a lot from users who are blind as well as users with low vision, and those lessons shaped this project.” Because of this, the animation is opt-in and gated behind its own flag—so it’s not something developers see by default. And when developers run the CLI in –screen-reader mode, the banner is automatically skipped so no decorative characters or motion are sent to assistive technologies. Part 6: An architecture built to scale By the end of the refactor, the team had: Frames stored as plain text Animation elements Themes as simple mappings A runtime colorization step Ink-driven timing and rendering A maintainable foundation for future animations This pattern—storing frames as plain text, layering semantic roles, and applying themes at runtime—isn’t specific to Copilot. It’s a reusable approach for anyone building terminal UIs or animations. Part 7: What this project reveals about building for the terminal A “simple ASCII banner” turned into: A frame-based animation tool that didn’t exist A custom ANSI color palette strategy A new Ink component A maintainable rendering architecture Accessibility-first CLI design choices A designer’s first engineering contribution Real-world testing across diverse terminals Open source contributions from the community “The most rewarding part was stepping into open source for the first time,” Cameron said. “With Copilot, I was able to build out my MVP ASCII animation tool into a full open source app at ascii-motion.app,. Someone fixed a typo in my README, and it made my day.” As Andy pointed out, building accessible experiences for CLIs is still largely unexplored territory and far behind the tooling and standards available for the web. Today, developers are already contributing to Cameron’s ASCII Motion tool, and the Copilot CLI team can ship new animations without rebuilding the system. This is what building for the terminal demands: deep understanding of constraints, discipline around accessibility, and the willingness to invent tooling where none exists. Use GitHub Copilot in your terminal The GitHub Copilot CLI brings AI-assisted workflows directly into your terminal — including commands for explaining code, generating files, refactoring, testing, and navigating unfamiliar projects. Try GitHub Copilot CLI > The post From pixels to characters: The engineering behind GitHub Copilot CLI’s animated ASCII banner appeared first on The GitHub Blog.
Read more →

PlayStation Plus Free Games For February 2026 Revealed - GameSpot

Here's the list of free games that PS Plus subscribers can claim in February 2026.
Read more →

iOS 26.2.1—Update Now Warning Issued To Millions Of iPhone Users - Forbes

Apple has released iOS 26.2.1, an important update that all iPhone users should apply now. Here's what you need to know.
Read more →

Drupal turns 25: From simple to complex — then simple again

It’s rare that a web product lasts 25 years, given how fast the industry cycles through technologies. But this month marks a quarter century of Drupal, the open source content management system (CMS). To mark the occasion, and also to discuss the launch of Drupal CMS 2.0 — which, confusingly, is not version 2 of the original Drupal — we spoke to founder Dries Buytaert. “I think people think Drupal is this overnight success or something,” Buytaert tells The New Stack. “But I think in reality, it’s been this very slow, gradual growth.” He notes that although he launched Drupal in 2001, the first Drupal conference wasn’t until four years later, in 2005 — “like, 30 or 40 people showed up,” Buytaert chuckles. Drupal was launched on January 15, 2001 (coincidentally, the same day Wikipedia debuted). At the time it was a relatively simple PHP and MySQL content management system; indeed, its initial appeal was that it was far simpler than the bulky CMS software of the time, like Interwoven and Vignette. I can vouch for that, as I was using Interwoven in 2001 in my job as a company website manager — and I remember that it was a beast of a CMS. Drupal complexity and its fit in the AI era Ironically, Drupal itself became more complex over time, as it continued to expand and add to the core platform. Drupal these days is most often viewed as a DXP (Digital Experience Platform), competing with the likes of Adobe Experience Manager and Salesforce Marketing Cloud. “Drupal Core” is the name of the open source framework, and its tagline is “Create Ambitious Digital Experiences.” Buytaert argues, though, that the complexity that Drupal has accumulated over the years has actually made it very suitable for the current AI era. Complexity is “Drupal’s accidental advantage in AI” – Dries Buytaert, Drupal founder “For a long time, I think Drupal was perceived as a little bit more complex, also more advanced to use,” he tells The New Stack. “And it turns out, I talk about it as Drupal’s accidental advantage in AI — like, we’ve built a lot of features that contribute to that complexity.” His point is that AI systems (LLMs in particular) thrive on complexity — the more data that LLMs can gobble up, the better. Buytaert gives the example of “configuration versioning” in Drupal, which he says a lot of other CMSs don’t have. So if, for example, you move a block around on a page but then want to revert back, you can do that through configuration versioning. “So those features, they actually make our APIs more complex, and sometimes our user interface more complex,” Buytaert says. “But it turns out these are exactly the features AI agents need […] because they make mistakes, right? Like, they hallucinate, they make mistakes. And so now we have the ability to undo or roll back those mistakes in a way that maybe most of our competitors don’t have.” That’s a good way to spin the benefits of Drupal’s complexity, but the market reality is that the AI era has also forced Drupal to come up with simpler solutions. Drupal CMS 2.0 and the push for non-developers Up till very recently, the bulk of Drupal’s users have been developers — back in October 2022, Buytaert was telling The New Stack about Drupal’s “headless CMS” capabilities, which at the time was a trendy way for developers to set up a custom CMS for their organization or clients. But in a world where anyone can “vibe code” a website or web app out of seemingly thin air, Drupal needs to appeal to non-developers too. This has resulted in a product called “Drupal CMS” — which is actually a completely separate product from the DXP software, although it’s still built on the foundation of Drupal Core. Drupal CMS 2.0 is being released today, Friday. It comes about a year after version 1.0, which was released last January. Version 2.0 is both a return to the simpler Drupal CMS product of yore and a response to the current trend of vibe coding. Dries Buytaert quote “The idea is that we created a new version of Drupal, if you want to think about it that way, [although] it’s built on top of the old version of Drupal,” Buytaert tells TNS. “So it’s not a fork or anything, but it’s a new version of Drupal where we added a lot of capabilities with the idea to make Drupal easier to use for a broader audience of people.” The main goal of Drupal CMS is to broaden its user base beyond developers, to include marketers and indeed all non-developers. To do this, Drupal CMS needed an easy-to-use visual interface for creating web pages — ideally one that included AI functionality, to help with layout and coding. With that need top of mind, one of the new features in 2.0 is Drupal Canvas, “a visual page-building interface” that comes with pre-built templates and “optional AI.” Drupal Canvas; image via The Drupal Association Interestingly, this return to simplicity and attendant embrace of AI has led to a boost in activity in the Drupal open source project, says Buytaert. “It has really sparked a lot of energy in Drupal. […] If you look at the number of contributions in Drupal, especially to strategic initiatives, it has doubled in the last 18 months since the start of Star Shot [the original code name of Drupal CMS], and so it has really created this new energy, in a way. A lot of people have been contributing to it.” Community takes 10 years to build This brings us back to the core aspect of Drupal that has led to it continuing to grow and evolve over 25 years: Its open source community. Not only that, but a good portion of Drupal’s earliest adopters have stuck around. “There’s a private email going around to [about] 50 of us that were around in the early years, and I would say half of them have moved on to do other things, and then the other half is surprisingly still involved through Drupal,” Buytaert tells The New Stack. “So there is definitely a core group that has been doing this for over 20 years, which is pretty special.” “…everything worth doing, it’s probably best to commit for 10 years.” – Buytaert What tips, then, does he have for new open source projects trying to get a foothold in a tech landscape dominated by multinational corporations like Google, Apple and Meta? “Don’t expect overnight success,” Buytaert warns. “I think anything successful in life usually takes 10 years.” He mentions not just the original Drupal project, but also the company he formed to sell Drupal products and services — Acquia, which he launched at the end of 2007 with his business partner, Jay Batson. After raising a lot of VC money, Acquia eventually sold to Vista Equity Partners for a reported $1 billion (Buytaert is still executive chairman at the company). “It took, like, 10 years before CMOs and CIOs actually had heard about Acquia,” Buytaert tells TNS. “And so everything worth doing, it’s probably best to commit for 10 years.” The post Drupal turns 25: From simple to complex — then simple again appeared first on The New Stack.
Read more →

Year recap and future goals for the GitHub Innovation Graph

Today’s data release marks our second full year of regular releases since the launch of the GitHub Innovation Graph. The Innovation Graph serves as a stable, regularly updated source for aggregated statistics on public software development activity around the world, informing public policy, strengthening research, guiding funding decisions, and equipping organizations with the evidence needed to build secure and resilient AI systems. Updated bar chart races With our new data release, we’ve updated the bar chart race videos to the git pushes, repositories, developers, and organizations global metrics pages. Let’s take a look back at some of the progress the Innovation Graph has helped drive. Academic papers One of the most rewarding aspects of the past year has been seeing the growing range of research questions addressed with Innovation Graph data. Recent papers have explored everything from global collaboration networks to the institutional foundations of digital capabilities. These studies showcase how network analysis techniques can be applied to Innovation Graph data, in addition to earlier work we referenced last year linking open source to economic value, innovation measurement, labor markets, and AI-driven productivity through other methodologies. Historical Institutions and Modern Digital Capabilities: New Evidence from GitHub in Africa Research by an economist at the Federal Reserve Board uses GitHub data to examine how the density of Protestant mission stations correlates with present-day participation in digital production across African countries. Olana, Deriba, “Historical Institutions and Modern Digital Capabilities: New Evidence from GitHub in Africa” (November 25, 2025). Available at SSRN: https://ssrn.com/abstract=5805622 or http://dx.doi.org/10.2139/ssrn.5805622. The Structure of Cross-National Collaboration in Open-Source Software Development Researchers from MIT, Carnegie Mellon, and the University of Chicago analyze international collaboration patterns in the Innovation Graph’s economy collaborators dataset, shedding light on how common colonial histories influence modern software development collaboration activities. Xu, Henry, et al. “The Structure of Cross-National Collaboration in Open-Source Software Development,” (November 10, 2025). Available at doi.org/10.1145/3746252.3761237. Replication package available at https://github.com/hehao98/github-innovation-graph. Small-World Phenomenon of Global Open-Source Software Collaboration on GitHub A social network analysis by researchers at Midwestern State University and Tarleton State University highlights the tightly connected, small-world structure of global OSS collaboration. Zhang, Guoying, et al. “Small-World Phenomenon of Global Open-Source Software Collaboration on Github: A Social Network Analysis.” Journal of Global Information Management Vol. 33, No. 1 (2025). Available at doi.org/10.4018/JGIM.387412. The Software Complexity of Nations These researchers extend countries’ software economic complexity into the digital economy by leveraging the geographic distribution of programming languages in open source software, showing that software economic complexity predicts GDP, income inequality, and emissions, which have important policy implications. Juhász, Sándor, et al. “The Software Complexity of Nations.” Research Policy Vol. 55, No. 3. Available at doi.org/10.1016/j.respol.2026.105422. Conferences The Innovation Graph and related GitHub datasets were featured prominently in academic and policy discussions at a wide range of venues, including: ATLC25: The 10th Atlanta Conference on Science and Innovation Policy OpenForum Academy Symposium 2025 2nd CEU Vienna Data Analytics Jamboree Wharton Human-AI Research: 3rd Annual Business & Generative AI Conference News publications We were also encouraged to see Innovation Graph data referenced in major international reporting. In 2025, two pieces in The Economist drew on GitHub data examining China’s approach to open technology (June 17, 2025) and India’s potential role as a distinctive kind of AI superpower (September 18, 2025). Coverage like this reinforces the role that data on open source activity can play in understanding geopolitical and economic shifts. Reports Once again, Innovation Graph data contributed to several flagship reports, including: The 2025 Stanford AI Index Report The 2025 WIPO Global Innovation Index The Rise of FOSS in India report from the National Law School of India University We continue to value these opportunities to support macro-level measurement efforts, and we’re equally excited by complementary work that dives deeper into regional, institutional, and community-level dynamics. Moving forward As we move through 2026, we’re grateful for the community that has formed around the Innovation Graph, and we’re looking forward to building the next chapter together. Our focus will be on deepening collaboration, welcoming new perspectives, and creating clearer pathways for people to apply the Innovation Graph data in their own contexts, from strategy and research to product development and policy. The post Year recap and future goals for the GitHub Innovation Graph appeared first on The GitHub Blog.
Read more →

Did the Nintendo Switch 2 Really Have a Bad Holiday? We Asked Analysts - IGN

Since December, we've been seeing (and writing!) headlines discussing the seeming slowdown of Nintendo Switch 2 sales going into the holiday season. But there's some nuance to this narrative, so I kicked off the new year by bugging all the analysts I knew for…
Read more →

Leak shows Google’s new Aluminium OS in action for the first time - PCWorld

In screenshots and videos, we get a first peek at Google's new operating system that combines Android and ChromeOS.
Read more →

Scott Pilgrim EX launches March 3 - Gematsu

Retro-style side-scrolling adventure brawlerScott Pilgrim EX will launch digitally for PlayStation 5, Xbox Series, PlayStation 4, Switch, and PC via Steam on March 3 for $28.99, developer Tribute G…
Read more →

Action adventure RPG Emberville launches in Early Access this summer - Gematsu

Action adventure RPG Emberville will launch in Early Access for PC via Steam this summer, developer Cygnus Cross announced.
Read more →

Agoda’s secret to 50x scale: Getting the database basics right

Agoda is the Singapore wing of Booking Holdings, the world’s leading provider of online travel (the brand behind Booking.com, Kayak, Priceline, etc.). From January 2023 to February 2025, Agoda server traffic spiked by 50 times. That’s fantastic business growth, but also the trigger for an interesting engineering challenge. Specifically, the team had to determine how to scale their ScyllaDB-backed online feature store to maintain 10ms P99 latencies despite this growth. Complicating the situation, traffic was highly bursty, cache hit rates were unpredictable and cold-cache scenarios could flood the database with duplicate read requests in a matter of seconds. At Monster Scale Summit 2025, Worakarn Isaratham, lead software engineer at Agoda, shared how they tackled the challenge. You can watch his entire talk or read the highlights below. Note: Monster Scale Summit is a free, virtual conference on extreme-scale engineering with a focus on data-intensive applications. Learn from luminaries like antirez, creator of Redis; Camille Fournier, author of “The Manager’s Path” and “Platform Engineering”; Martin Kleppmann, author of “Designing Data-Intensive Applications” and more than 50 others, including engineers from Discord, Disney, Pinterest, Rivian, Datadog, LinkedIn, and Uber Eats. Register and join us March 11-12 for some lively chats. A feature store powered by ScyllaDB and DragonflyDB Agoda operates an in-house feature store that supports both offline model training and online inference. For anyone not familiar with feature stores, Isaratham provided a quick primer. A feature store is a centralized repository designed for managing and serving machine learning features. In the context of machine learning, a feature is a measurable property or characteristic of a data point used as input to models. The feature store helps manage features across the entire machine learning pipeline — from data ingestion to model training to inference. Feature stores are integral to Agoda’s business. Isaratham explained: “We’re a digital travel platform, and some use cases are directly tied to our product. For example, we try to predict what users want to see, which hotels to recommend and what promotions to serve. On the more technical side, we use it for things like bot detection. The model uses traffic patterns to predict whether a user is a bot, and if so, we can block or deprioritize requests. So the feature store is essential for both product and engineering at Agoda. We’ve got tools to help create feature ingestion pipelines, model training, and the focus here: online feature serving.” One layer deeper into how it works: “We’re currently serving about 3.5 million entities per second (EPS) to our users. About half the features are served from cache within the client SDK, which we provide in Scala and Python. That means 1.7 million entities per second reach our application servers. These are written in Rust, running in our internal Kubernetes pods in our private cloud. From the app servers, we first check if features exist in the cache. We use DragonflyDB as a non-persistent centralized cache. If it’s not in the cache, then we go to ScyllaDB, our source of truth.” ScyllaDB is a high-performance database for workloads that require ultra-low latency at scale. Agoda’s current ScyllaDB cluster is deployed as six bare-metal nodes, replicated across four data centers. Under steady-state conditions, ScyllaDB serves about 200K entities per second across all data centers while meeting a service-level agreement (SLA) of 10ms P99 latency. (In practice, their latencies are typically even lower than their SLA requires.) Traffic growth and bursty workloads However, it wasn’t always that smooth and steady. Around mid-2023, they hit a major capacity problem when a new user wanted to onboard to the Agoda feature store. Their traffic pattern was super bursty: It was normally low, but occasionally it would flood them with requests triggered by external signals. These were cold-cache scenarios, where the cache couldn’t help. Isaratham shared, “Bursts reached 120K EPS, which was 12 times the normal load back then.” rRequest duplication exacerbated the situation. Many identical requests arrived in quick succession. Instead of one request populating the cache and subsequent requests benefiting, all of them hit ScyllaDB at the same time — a classic cache stampede. They also retried failed requests until they succeeded — and that kept the pressure high. This load involved two data centers. One slowed down but remained online. The other was effectively taken out of service. More details from Worakarn: “On the bad DC, error rates were high and retries took 40 minutes to clear; on the good one, it only took a few minutes. Metrics showed that ScyllaDB read latency spiked into seconds instead of milliseconds.” Diagnosing the bottleneck So, they compared setups and found the difference: the problematic data center used SATA SSDs while the better one used NVMe SSDs. SATA (serial advanced technology attachment) was already old tech, even then. The team’s speed tests suggested that replacing the disks would yield a 10X read performance boost — and better write rates too. The team ordered new disks immediately. However, given that the disks wouldn’t arrive for months, they had to figure out a survival strategy until then. As Isaratham shared, “Capacity tests and projections showed that we would hit limits within eight or nine months even without new load — and sooner with it. So, we worked with users to add more aggressive client-side caching, remove unnecessary requests and smooth out bursts. That reduced the new load from 120K to 7K EPS. That was enough to keep things stable, but we were still close to the limit.” Surviving with SATA Given the imminent capacity cap, the team brainstormed ways to improve the situation while still on the existing SATA disks. Since you have to measure before you can improve, getting a clean baseline was the first order of business. “The earlier capacity numbers were from real-world traffic, which included caching effects,” Isaratham detailed. “We wanted to measure cold-cache performance directly. So, we created artificial load using one-time-use test entities, bypassed cache in queries and flushed caches before and after each run. The baseline read capacity on the bad DC was 5K EPS.” With that baseline set, the team considered a few different approaches. Data modeling All features from all feature sets were stored in a single table. The team hoped that splitting tables by feature set might improve locality and reduce read amplification. It didn’t. They were already partitioning by feature set and entity, so the logical reorganization didn’t change the physical layout. Compaction strategy Given a read-heavy workload with frequent updates, ScyllaDB documentation recommends the size-tiered compaction strategy to avoid write amplification. But the team was most concerned about read latency, so they took a different path. According to Worakarn: “We tried leveled compaction to reduce the number of SSTables per read. Tests showed fetching 1KB of data required reading 70KB from disk, so minimizing SSTable reads was key. Switching to leveled compaction improved throughput by about 50%.” Larger SSTable summaries ScyllaDB uses summary files to more efficiently navigate index files. Their size is controlled by the sstable_summary_ratio setting. Increasing the ratio increases the summary file size, reducing index reads at the cost of additional memory. The team increased the ratio by 20 times, which boosted capacity to 20K EPS. This yielded a nice 4X improvement, so they rolled it out immediately. What a difference a disk makes Finally, the NVMe disks arrived a few months later. This one change made a massive difference. Capacity jumped to 300K EPS, a staggering 50-60X improvement. The team rolled out improvements in stages: first, the summary ratio tweak (for 2-3X breathing room), then the NVMe upgrade (for 50X capacity). They didn’t apply leveled compaction in production because it only affects new tables and would require migration. Anyway, NVMe already solved the problem. After that, the team shifted focus to other areas: improving caching, rewriting the application in Rust and adding cache stampede prevention to reduce the load on ScyllaDB. They still revisit ScyllaDB occasionally for experiments. A couple of examples: New partitioning scheme: They tried partitioning by feature set only and clustering by entity. However, performance was actually worse, so they didn’t move forward with this idea. Data remodeling: The application originally stored one row per feature. Since all features for an entity are always read together, the team tested storing all features in a single row instead. This improved performance by 35%, but it requires a table migration. It’s on their list of things to do later. Lessons learned Isaratham wrapped it up as follows: “We’d been using ScyllaDB for years without realizing its full potential, mainly because we hadn’t set it up correctly. After upgrading disks, benchmarking and tuning data models, we finally reached proper usage. Getting the basics right — fast storage, knowing capacity, and matching data models to workload — made all the difference. That’s how ScyllaDB helped us achieve 50X scaling.” The post Agoda’s secret to 50x scale: Getting the database basics right appeared first on The New Stack.
Read more →

Gemini details AI Plus limits, rolls out NotebookLM integration on iOS - 9to5Google

Google brought the AI Plus subscription to the US yesterday, and here’s how it upgrades usage limits in the Gemini app.
Read more →

Apple’s Creator Studio Offers Value, But Is Far From an Adobe Killer - Bloomberg

Apple Inc.’s new Creator Studio software bundle represents a new test for the company’s fast-growing services business: What happens when the tech giant packages some of its most popular creative apps into a recurring subscription?
Read more →

AMD Ryzen 7 9850X3D Review - The Best Just Got Better - TechPowerUp

The AMD Ryzen 9850X3D builds on the already excellent 9800X3D without trying to shake up the market. Instead, it focuses on refining what already works. AMD delivers modest improvements with smarter tuning, and the same cache magic that gamers love. Thanks to…
Read more →

Terraform challenger Formae expands to more clouds

Late last year, startup Platform Engineering Labs made waves in the world of Infrastructure as Code (IaC) by introducing a new IaC platform, called Formae, available initially on Amazon Web Services. This week, Platform Engineering Labs‘ platform gets (beta) support from additional cloud platforms, including Google Cloud Platform, Microsoft Azure, Oracle Cloud Infrastructure, and OVHcloud. The company has also released new AI-enhanced software for managing infrastructure tooling, called the Platform for Infrastructure Builders. “This release is for and about infrastructure builders,” says Pavlo Baron, co-founder and CEO of Platform Engineering Labs, in a statement. “From here forward, you don’t need to wait on us or anyone else. Build for your own infrastructure. Launch fast. Iterate fast. Extend fast. Do it hands-on or with help from your Al agents.” The company is pitching the platform for organizations that may have some components managed by IaC but want to expand operations to older, legacy resources that may have been previously thought too ornery to be managed under IaC. Schema-safe change management The new platform, with the accompanying software development kit (SDK), will allow users to extend their infrastructure with new components, offering schema safety and an easy-to-understand plug-in interface. “Engineers can now use AI agents to quickly produce and modify plugins that are reliable by design,” says Zachary Schneider, co-founder and CTO of Platform Engineering Labs, in a statement. The Formae software was built to automatically discover and codify system resources and system changes into a single unified source of truth. The founders claim that this approach offers superior state management and easier migration paths than the industry-leading IaC solution, HashiCorp’s Terraform. Infrastructure as Code Infrastructure as Code is the practice of saving your system’s configuration in a file, usually using YAML or JSON, which IaC orchestrators then use as an instruction set to roll out infrastructure. The advantages IaC promises are automated deployments — a real time saver — and a guard against system drift, which is when systems fall out of alignment from their desired state (usually due to manual intervention). Yet after everything is set up once, Day 2 operations with IaC can be a headache, Baron contended in an earlier interview with The New Stack. IaC files are brittle things. They quickly get complex and difficult to understand, are easy to corrupt with shadow IT work, and are easy to make mistakes with. They offer no guidance as to whether the values they hold are even correct. Within the Formae environ, an individual IT resource is extracted into a versioned, declarative code artifact called a “forma” (which is the Latin singular for “form”) that can then be programmed against. Unlike Terraform or Pulumi, state management in Formae is handled not by the clients themselves, but by agents, to guard against system drift. Changes are made in the same way security patches are rolled out, minimizing the blast radius of each update. The code is written in an unusual language, Apple’s Pkl, which that company developed in-house to manage its own system deployments. Pkl is different from JSON and YAML in that it forces users to develop a schema for each type of resource, along with a type annotation. With type annotation, the type values — and sometimes even a range of permissible values themselves — are already established for the variable itself. So fewer typos can sneak in and disrupt the operations. The open source version of Formae is available today on GitHub. The post Terraform challenger Formae expands to more clouds appeared first on The New Stack.
Read more →

Leaker makes unlikely claim about improved iPhone 18 telephoto performance - 9to5Mac

A leaker better known for posting about Android smartphones has made a claim potentially pointing to improved telephoto performance in...
Read more →

Exclusive: First look at Samsung’s 25W wireless charger, possibly for Galaxy S26 - SamMobile

It was reported earlier that Samsung plans to introduce a major upgrade to wireless charging speeds with its future smartphones. The Galaxy S26 series supports up to 25W fast wireless charging. We are now bringing you an exclusive first look at Samsung’s 25W …
Read more →

Windows 11’s ability to resume Android apps on your PC is getting closer - The Verge

Microsoft is getting ready to improve its cross-device resume feature in Windows 11. You’ll soon be able to resume Spotify and Office docs from phone to PC.
Read more →

'When Did It Become Trendy to Hate on a New Game?' — as Highguard Struggles to Win Over the Internet, Video Game Developers Come to Its Defense - IGN

A number of high-profile video game developers have defended Highguard amid an online backlash during the game’s launch.
Read more →

‘The Secret Fear of the Morally Depraved’

Adam Serwer, reporting from the streets of Minneapolis for The Atlantic, “Minnesota Proved MAGA Wrong” (gift link): The secret fear of the morally depraved is that virtue is actually common, and that they’re the ones who are alone. In Minnesota, all of the ideological cornerstones of MAGA have been proved false at once. Minnesotans, not the armed thugs of ICE and the Border Patrol, are brave. Minnesotans have shown that their community is socially cohesive — because of its diversity and not in spite of it. Minnesotans have found and loved one another in a world atomized by social media, where empty men have tried to fill their lonely soul with lies about their own inherent superiority. Minnesotans have preserved everything worthwhile about “Western civilization,” while armed brutes try to tear it down by force. ★
Read more →

‘A CEO, Captured’

Om Malik: Cook is not stupid. He is not evil. He is trapped. The iron clasp of market expectations has turned him into what he never meant to be: a man who goes to parties at the White House while nurses die. In Tinker Tailor Soldier Spy, Roy Bland captures a cynical, post-ideological, corrupt English society: “You scratch my conscience; I’ll drive your Jag.” You could say the same of today’s Silicon Valley. It used to believe it could change the world. Now it just hopes the world won’t change its stock price. Amy Jane Gruber: If I ever meet Tim Cook I’m going to ask him if Mike Tyson enjoyed the movie. ★
Read more →

‘Aside From That, Mr. Cook, What Did You Think of the Movie?’

MG Siegler: Tim Cook is captured. There is simply no other explanation for his actions over the past year or so. But it perhaps culminated this weekend when Cook went to a special private showing of the documentary Melania at the White House. Yes, that Melania. That in and of itself would have probably been fine. I mean, it’s potentially problematic for a host of reasons that I’ll get to, but such is our world right now. Then one shot — a gunshot — turned attending that movie screening into a statement... While Cook was enjoying his popcorn and champagne with the likes of Mike Tyson, Tony Robbins, and other “VIPs”, it was complete and utter chaos on the streets of Minnesota. Just hours earlier, Alex Pretti, a 37-year-old ICU nurse, was shot and killed by ICE agents. Maybe, just maybe, postpone the movie premiere? ★
Read more →

‘Whatever’

Ben Terris, writing for New York Magazine: Fred Trump died in 1999 at age 93. He had, Trump said, a “heart that couldn’t be stopped” with almost no health conditions to speak of throughout his long life. “He had one problem,” Trump said. “At a certain age, about 86, 87, he started getting, what do they call it?” He pointed to his forehead and looked to his press secretary for the word that escaped him. “Alzheimer’s,” Leavitt said. “Like an Alzheimer’s thing,” Trump said. “Well, I don’t have it.” “Is it something you think about at all?” I asked. “No, I don’t think about it at all. You know why?” he said. “Because whatever it is, my attitude is whatever.” ★
Read more →

Clawdbot Is Now Moltbot

From the footer on the project’s website: Moltbot was formerly known as Clawdbot. Independent project, not affiliated with Anthropic. Makes sense, to be honest, that Anthropic would object to naming it a homonym for Claude. One additional followup to my post the other day. In his terrific introduction to ClawdMoltbot, Federico Viticci wrote: I’ve been playing around with Clawdbot so much, I’ve burned through 180 million tokens on the Anthropic API (yikes), and I’ve had fewer and fewer conversations with the “regular” Claude and ChatGPT apps in the process. Those tokens aren’t free. I asked Viticci just how much “yikes” cost, and he said around US$560 — using way more input than output tokens. ★
Read more →

★ The Names They Call Themselves

Jonathan Rauch, writing for The Atlantic, “Yes, It’s Fascism” (gift link): Until recently, I resisted using the F-word to describe President Trump. For one thing, there were too many elements of classical fascism that didn’t seem to fit. For another, the term has been overused to the point of meaninglessness, especially by left-leaning types who call you a fascist if you oppose abortion or affirmative action. For yet another, the term is hazily defined, even by its adherents. From the beginning, fascism has been an incoherent doctrine, and even today scholars can’t agree on its definition. Italy’s original version differed from Germany’s, which differed from Spain’s, which differed from Japan’s. [...] When the facts change, I change my mind. Recent events have brought Trump’s governing style into sharper focus. Fascist best describes it, and reluctance to use the term has now become perverse. That is not because of any one or two things he and his administration have done but because of the totality. Fascism is not a territory with clearly marked boundaries but a constellation of characteristics. When you view the stars together, the constellation plainly appears. Rauch goes on to describe that constellation clearly and copiously, with evidence. I agree, wholeheartedly, with his conclusion that “If, however, Trump is a fascist president, that does not mean that America is a fascist country.” The shoe fits, however tightly. But there’s a problem that’s been gnawing at me ever since the 2.0 Trump Administration began. The entire premise of Rauch’s essay — the issue he changed his mind about — is that it’s contentious to describe people, let alone an entire political party or government, as “fascist” or “Nazi”. With only the most extremist exceptions, it’s a broad cultural value — a shared global value, not merely an American or western one — that the Nazis and Fascists were abominable. Also, they were losers, and their complete and total destruction was celebrated around the world. Hitler shot himself, hiding in a dingy filthy bunker. Mussolini was summarily executed and his body strung up in a public square in Milan. Hirohito surrendered unconditionally and lived his remaining days in quiet shame and infamy. No matter how apt the definition of fascist fits the Trump regime, they themselves reject the term, as they do not see themselves as being on the wrong side, and the definition of fascism is that it’s wrong. And they (exemplified by Trump himself) have a deep-seated psychological aversion to being seen as losers, even when it is as plain to see as the sun that they have lost — and no one denies that the Fascists and Nazis lost, bigly. We call Benito Mussolini’s regime “fascist” because he coined the term. His political movement was literally named the Fascist Party. There was no debate whether Hitler and his regime were Nazis because that was their name. “Fascist” and “Nazi” weren’t slurs that were applied to them by their political or military opponents. That’s what they called themselves, and their names became universally recognized slurs because the actions and beliefs of the Fascists and Nazis were universally recognized as reprehensible and evil. And because they lost. Our goal should not be to make fascist or Nazi apply to Trump’s movement, no matter how well those rhetorical gloves fit his short-fingered disgustingly bruised hands. Don’t call Trump “Hitler”. Instead, work until “Trump” becomes a new end state of Godwin’s Law. The job won’t be done, this era of madness will not end, until we make the names they call themselves universally acknowledged slurs. “MAGA” and “Trumpist”, for sure. “Republican”, perhaps. Make those names shameful, deservedly, now, and there will be no need to apply the shameful names of hateful anti-democratic illiberal failed nationalist movements from a century ago. We need to assert this rhetoric with urgency, make their names shameful, lest the slur become our name — “American”.
Read more →

What It’s Like to Get Undressed by Grok

Ella Chakarian, writing for Rolling Stone (News+): On a recent Saturday afternoon, Kendall Mayes was mindlessly scrolling on X when she noticed an unsettling trend surface on her feed. Users were prompting Grok, the platform’s built-in AI feature, to “nudify” women’s images. Mayes, a 25-year-old media professional from Texas who uses X to post photos with her friends and keep up with news, didn’t think it would happen to her — until it did. “Put her in a tight clear transparent bikini,” an X user ordered the bot under a photo that Mayes posted from when she was 20. Grok complied, replacing her white shirt with a clear bikini top. The waistband of her jeans and black belt dissolved into thin, translucent strings. The see-through top made the upper half of her body look realistically naked. Hiding behind an anonymous profile, the user’s page was filled with similar images of women, digitally and nonconsensually altered and sexualized. Mayes wanted to cuss the faceless user out, but decided to simply block the account. She hoped that would be the end of it. Soon, however, her comments became littered with more images of herself in clear bikinis and skin-tight latex bodysuits. Mayes says that all of the requests came from anonymous profiles that also targeted other women. Though some users have had their accounts suspended, as of publication, some of the images of Mayes are still up on X. And: Emma, a content creator, was at the grocery store when she saw the notifications of people asking Grok to undress her images. [...] Numbness washed over Emma when the images finally loaded on her timeline. A selfie of her holding a cat had been transformed into a nude. The cat was removed from the photo, Emma says, and her upper body was made naked. Emma immediately made her account private and reported the images. In an email response reviewed by Rolling Stone, X User Support asked her to upload an image of her government-issued ID so they could look into the report, but Emma responded that she didn’t feel comfortable doing so. [...] In our call, she checked to see if some of the image edits she was aware of were still up on X. They were. “Oh, my God,” she says, letting out a defeated sigh. “It has 15,000 views. Oh, that’s so sad.” This fun app is available, free of charge, on the App Store, which means you know it’s safe and approved by Apple. Get it today. ★
Read more →

Assessing internal quality while coding with an agent

Erik Doernenburg is the maintainer of CCMenu: a Mac application that shows the status of CI/CD builds in the Mac menu bar. He assesses how using a coding agent affects internal code quality by adding a feature using the agent, and seeing what happens to the code. more…
Read more →

The Talk Show: ‘A Mitigated Disaster’

Daniel Jalkut returns to the show so we can both vent about MacOS 26 Tahoe. Sponsored by: Notion: The AI workspace where teams and AI agents get more done together. Squarespace: Save 10% off your first purchase of a website or domain using code talkshow. Sentry: A real-time error monitoring and tracing platform. Use code TALKSHOW for $80 in free credits. Factor: Healthy eating, made easy. Get 50% off your first box, plus free breakfast for 1 year, with code talkshow50off. ★
Read more →

CISA’s acting head uploaded sensitive files into public version of ChatGPT

Comments
Read more →

Open Responses vs. Chat Completion: A new era for AI apps

The ability to build portable, provider-agnostic AI applications is the future of agentic development. For the past few years, OpenAI’s Chat Completion API has been considered the de facto standard for interacting with LLMs. Major model providers, open source serving platforms, and AI gateways supported this standard. While this API served well during the stateless chatbot era, it falls short of many capabilities that agents expect. OpenAI officially transitioned from Chat Completion API to Responses API in March 2025. Compared to the former, the Responses API is designed to support native stateful conversations to handle multi-turn interactions. Through caching and reasoning support, it dramatically improves the API’s performance. Other capabilities include built-in tool integration, event streaming, and support for multimodal inputs (including text and images). Recently, OpenAI created a specification called Open Responses in collaboration with major ecosystem players, including Nvidia, Vercel, OpenRouter, Hugging Face, LM Studio, Ollama, and vLLM. Based on the Responses API, the specification is meant for building multi-provider, interoperable LLM interfaces. It defines a shared schema, client library, and tooling layer that enable a unified experience independent of the model type and model provider. For data scientists and AI developers building intelligent applications, understanding Open Responses is essential. The patterns mirror familiar API concepts from OpenAI’s ecosystem, including chat completions for message exchanges, tool calls for function invocation, streaming outputs for real-time responses, and multimodal inputs for handling text and images. This article breaks down Open Responses using these accessible parallels, delivering clarity for practitioners who operate in production environments. The problems that Open Responses solve Modern LLM applications have outgrown the chatbot paradigm. Developers building autonomous agents need models that reason over multiple steps, invoke tools autonomously, and maintain context across complex workflows. Yet the ecosystem remains fragmented around the Chat Completions format, which was a specification originally designed for turn-based conversations — but it falls short for agentic use cases. The mismatch manifests in several concrete problems: Manual state management: Chat Completions is stateless, requiring developers to shuttle entire conversation histories back and forth with each request. Tool orchestration complexity: Multi-step tool calling requires manual “loop until done” logic in application code. Lost reasoning context: Reasoning tokens from models like o3 and o4-mini are discarded between turns, degrading performance on agentic tasks. No built-in capabilities: Web search, file retrieval, and code execution require custom infrastructure. Though OpenAI addressed these limitations with the Responses API (/v1/responses) in March 2025, it has remained an opaque, proprietary interface. Open Responses defines a consistent request/response shape that any provider can implement. In practical terms, it lets you keep one client integration while switching the backend model runtime. If you’ve ever maintained multiple SDKs for multiple model providers, you already understand the pain this removes. For teams utilizing both a hosted frontier model and a local open source model, managing branching logic across applications without a unified API becomes complex. By adopting Open Responses, the integration achieves stability with only routing modifications required. This approach, centered on stable contracts and swappable implementations, is essential for maintaining robust and maintainable real-world systems. Comparing Chat Completion API with Open Responses To understand the magnitude of the shift, we must compare the developer experience and architectural footprint of the two paradigms. Feature Legacy Chat Completion (v1/chat) Open Responses (v1/responses) Control Logic Client-Side: Developer writes while loops, parses JSON, handles retries. Server-Side: Developer declares intent; server manages the loop/state machine. State Stateless: History must be re-uploaded with every request. Stateful: previous_response_id loads context from server cache. Streaming Token Deltas: Raw text chunks. Hard to parse structures. Semantic Events: Typed events (tool.start, tool.end, content.add). Tool Execution Client-Driven: Client executes, re-prompts. High latency. Server-Driven: Server executes internal tools; manages flow for external ones. Reasoning Implicit: Mixed into content or hacked via thinking tags. Explicit: Dedicated fields (content , encrypted_content , summary). Multimodality Bolted on: Images sent as URLs in text messages. Native: Polymorphic Items support images/video as first-class citizens. Network Traffic High: N round-trips for N steps. Full history upload. Low: 1 request for N steps. Only the delta input upload. Ecosystem backing The launch partners represent comprehensive ecosystem coverage for the Responses API specification: Provider Implementation Status OpenAI Full Responses API (original) Hugging Face Inference Providers integration, early access via Spaces OpenRouter Launch partner, enabling “almost every existing model” NVIDIA NIM Experimental /v1/responses endpoint support Ollama Added in v0.13.3, non-stateful flavor vLLM Full Responses API compatible server LM Studio Open Responses compliant endpoint Azure OpenAI Full Responses API via Microsoft The beginning of the agentic era The introduction of Open Responses marks the end of the “Chatbot Era” and the beginning of the “Agentic Era.” For too long, developers have struggled with the “Square Peg, Round Hole” problem of forcing autonomous behaviors into conversational APIs. The resulting “Agentic Hell” of brittle, high-latency, client-side loops held back the true potential of AI. Open Responses solves this by recognizing that Agency is an Infrastructure Problem, not just a model capability problem. By standardizing the Agentic Loop, defining polymorphic Items, and solving the state management crisis, it provides the robust foundation needed to build the next generation of software. For the enterprise, the path is clear: Adopt the standard to future-proof applications against vendor lock-in and enable hybrid-cloud deployments via Nvidia NIM and local models. For the open source community, the standard provides a rallying cry — a shared language that allows a federated ecosystem of models, tools, and routers to compete with the monolithic silos of the proprietary giants. The new standard offers clear benefits for both enterprises and the open-source community. For Enterprises: Adopting the standard is key to future-proofing applications, safeguarding against vendor lock-in, and facilitating hybrid-cloud deployments by leveraging Nvidia NIM and locally hosted models. For the Open Source Community: The standard acts as a unifying force, providing a common language. This enables a federated ecosystem of models, tools, and routers to effectively compete with the closed, proprietary systems of major industry players. We are no longer just chatting with text. We are orchestrating cognition, and Open Responses is the conductor’s baton. The post Open Responses vs. Chat Completion: A new era for AI apps appeared first on The New Stack.
Read more →

The agentic revolution: A new vision for SREs

Site reliability engineers (SREs) are no longer an afterthought for harried IT leaders. They play a critical role in ensuring digital services work reliably at scale. But as complexity builds and incident volumes grow, SRE teams are being stretched thin by manual processes that degrade their value to the organization. This is where AI agents can help SRE teams break free of a reactive doom loop. When deployed strategically, agents can enable teams to move past toil and proactively enhance operational efficiency and resilience. By automatically surfacing context, executing diagnostics and remediations, and generating self-updating runbooks, AI agents empower SREs to prioritize their attention on the most critical matters. SREs vs. DevOps SRE is still an often-misunderstood role. It’s not interchangeable with DevOps, but rather brings an engineering discipline to operations for improved reliability and uptime. The production and success of SRE teams can be elevated through their ability to automate repeatable tasks. Organizations can incorporate SREs into IT operations in various ways. There might be a centralized department serving the entire organization. There may be one or two SREs embedded within the engineering team. Or SREs might act as consultants, available on an “as-needed” basis. In some instances, developers might even be encouraged to adopt SRE skills. Regardless of the model, a persistent challenge threatens to undermine their value. Site reliability engineering, like IT operations in general, is buckling under the weight of inefficient tools and manual processes. Enhancing SRE workflows To relieve that operational burden, many SREs are already using generative AI (GenAI). While GenAI can accelerate incident resolution, it still demands input from human experts. Teams don’t just want AI assistants. They want AI agents that SREs can fully offload low-risk, toilsome tasks to. As the adoption of AI agents increases, SREs will evolve into supervisors of a new digital workforce, delegating tasks for all issues except for the most complex or novel ones. How might agentic AI look in practice for SREs? Consider how an AI agent can surface useful contextual information for investigators to drill down into. This might include previously resolved incidents involving the same service to immediately highlight how similar issues were remediated in the past, including responder notes. Agents can further enhance context for SRE incident responders by including information on related active issues across different services, which would provide the SRE with crucial real-time information on the scope of the incident and any potential dependencies. Using this information, an AI agent could go a step further by suggesting where an issue has originated, and whether recent configuration or other changes may be the root cause. The most effective agentic tools will continuously learn from SRE feedback and successful remediation, enabling the AI agents to get smarter and more sophisticated as time goes on. The next steps Once an issue is diagnosed and context delivered, remediation is the next stage that AI agents can optimize. For low-risk, well-understood issues with clearly defined and known solutions, an agent could triage and remediate without any human input. All the SRE would need to do is review the after-action report to ensure it’s correct and check for any potential improvements. At the other end of the spectrum, novel or major incidents will require SREs to guide the investigation and develop their own remediation plan. In this scenario, the agent’s value is in automatically collecting useful contextual information and answering any questions. Sitting in the middle are partially understood incidents, which are familiar but typically have multiple possible causes or solutions. In this scenario, the SRE agent would first cross-reference an alert with historical operations data and real-time signals. It might nudge the SRE into running further diagnostics or supply them automatically so the SRE has a range of possible causes to consider upon arrival. The AI agent would then suggest possible remediation steps, further reducing manual effort and time to action. The result of this remediation, as well as any feedback from the engineer, would help to generate a self-updating runbook consisting of which actions worked best. This continuous learning approach helps to prevent recurring issues and enable faster resolutions with fewer people. Getting started To extract maximum value from AI agents, managers will have to be careful about the way they position the technology. Managers will need to equip SREs with the right training in areas such as data security, output validation, and workflow creation. The best systems will be vendor agnostic to better surface real-time information from across the entire IT environment and will have access to as much historical operations data as possible. The benefits of getting this right could be transformative. In the right circumstances, AI agents can resolve incidents faster, reduce SRE toil and burnout, and proactively optimize processes in ways even human experts might not spot. Above all, this means SREs can focus on the work that really matters: supporting innovation and growth. The post The agentic revolution: A new vision for SREs appeared first on The New Stack.
Read more →

7 learnings from Anders Hejlsberg: The architect behind C# and TypeScript

Anders Hejlsberg’s work has shaped how millions of developers code. Whether or not you recognize his name, you likely have touched his work: He’s the creator of Turbo Pascal and Delphi, the lead architect of C#, and the designer of TypeScript. We sat down with Hejlsberg to discuss his illustrious career and what it’s felt like to watch his innovations stand up to real world pressure. In a long-form conversation, Hejlsberg reflects on what language design looks like once the initial excitement fades, when performance limits appear, when open source becomes unavoidable, and how AI can impact a tool’s original function. What emerges is a set of patterns for building systems that survive contact with scale. Here’s what we learned. Watch the full interview above. Fast feedback matters more than almost anything else Hejlberg’s early instincts were shaped by extreme constraints. In the era of 64KB machines, there was no room for abstraction that did not pull its weight. “You could keep it all in your head,” he recalls. When you typed your code, you wanted to run it immediately. Anders Hejlsberg Turbo Pascal’s impact did not come from the Pascal language itself. It came from shortening the feedback loop. Edit, compile, run, fail, repeat, without touching disk or waiting for tooling to catch up. That tight loop respected developers’ time and attention. The same idea shows up decades later in TypeScript, although in a different form. The language itself is only part of the story. Much of TypeScript’s value comes from its tooling: incremental checking, fast partial results, and language services that respond quickly even on large codebases. The lesson here is not abstract. Developers can apply this directly to how they evaluate and choose tools. Fast feedback changes behavior. When errors surface quickly, developers experiment more, refactor more confidently, and catch problems closer to the moment they are introduced. When feedback is slow or delayed, teams compensate with conventions, workarounds, and process overhead. Whether you’re choosing a language, framework, or internal tooling, responsiveness matters. Tools that shorten the distance between writing code and understanding its consequences tend to earn trust. Tools that introduce latency, even if they’re powerful, often get sidelined. Scaling software means letting go of personal preferences As Hejlsberg moved from largely working alone to leading teams, particularly during the Delphi years, the hardest adjustment wasn’t technical. It was learning to let go of personal preferences. You have to accept that things get done differently than you would have preferred. Fixing it would not really change the behavior anyway. Anders Hejlsberg That mindset applies well beyond language design. Any system that needs to scale across teams requires a shift from personal taste to shared outcomes. The goal stops being code that looks the way you would write it, and starts being code that many people can understand, maintain, and evolve together. C# did not emerge from a clean-slate ideal. It emerged from conflicting demands. Visual Basic developers wanted approachability, C++ developers wanted power, and Windows demanded pragmatism. The result was not theoretical purity. It was a language that enough people could use effectively. Languages do not succeed because they are perfectly designed. They succeed because they accommodate the way teams actually work. Why TypeScript extended JavaScript instead of replacing it TypeScript exists because JavaScript succeeded at a scale few languages ever reach. As browsers became the real cross-platform runtime, teams started building applications far larger than dynamic typing comfortably supports. Early attempts to cope were often extreme. Some teams compiled other languages into JavaScript just to get access to static analysis and refactoring tools. That approach never sat well with Hejlsberg. Telling developers to abandon the ecosystem they were already in was not realistic. Creating a brand-new language in 2012 would have required not just a compiler, but years of investment in editors, debuggers, refactoring tools, and community adoption. Instead, TypeScript took a different path. It extended JavaScript in place, inheriting its flaws while making large-scale development more tractable. This decision was not ideological, but practical. TypeScript succeeded because it worked with the constraints developers already had, rather than asking them to abandon existing tools, libraries, and mental models. The broader lesson is about compromise. Improvements that respect existing workflows tend to spread while improvements that require a wholesale replacement rarely do. In practice, meaningful progress often comes from making the systems you already depend on more capable instead of trying to start over. Visibility is a part of what makes open source work TypeScript did not take off immediately. Early releases were nominally open source, but development still happened largely behind closed doors. That changed in 2014 when the project moved to GitHub and adopted a fully public development process. Features were proposed through pull requests, tradeoffs were discussed in the open, and issues were prioritized based on community feedback. This shift made decision-making visible. Developers could see not just what shipped, but why certain choices were made and others were not. For the team, it also changed how work was prioritized. Instead of guessing what mattered most, they could look directly at the issues developers cared about. The most effective open source projects do more than share code. They make decision-making visible so contributors and users can understand how priorities are set, and why tradeoffs are made. Leaving JavaScript as an implementation language was a necessary break For many years, TypeScript was self-hosted. The compiler was written in TypeScript and ran as JavaScript. This enabled powerful browser-based tooling and made experimentation easy. Over time, however, the limitations became clear. JavaScript is single-threaded, has no shared-memory concurrency, and its object model is flexible (but expensive). As TypeScript projects grew, the compiler was leaving a large amount of available compute unused. The team reached a point where further optimization would not be enough. They needed a different execution model. The controversial decision was to port the compiler to Go. This was not a rewrite. The goal was semantic fidelity. The new compiler needed to behave exactly like the old one, including quirks and edge cases. Rust, despite its popularity, would have required significant redesign due to ownership constraints and pervasive cyclic data structures. Go’s garbage collection and structural similarity made it possible to preserve behavior while unlocking performance and concurrency. The result was substantial performance gains, split between native execution and parallelism. More importantly, the community did not have to relearn the compiler’s behavior. Sometimes the most responsible choice isn’t the most ambitious one, but instead preserves behavior, minimizes disruption, and removes a hard limit that no amount of incremental optimization can overcome. In an AI-driven workflow, grounding matters more than generation Hejlberg is skeptical of the idea of AI-first programming languages. Models are best at languages they have already seen extensively, which naturally favors mainstream ecosystems like JavaScript, Python, and TypeScript. But AI does change things when it comes to tooling. The traditional IDE model assumed a developer writing code and using tools for assistance along the way. Increasingly, that relationship is reversing. AI systems generate code. Developers supervise and correct. Deterministic tools like type checkers and refactoring engines provide guardrails that prevent subtle errors. In that world, the value of tooling is not creativity. It is accuracy and constraint. Tools need to expose precise semantic information so that AI systems can ask meaningful questions and receive reliable answers. The risk is not that AI systems will generate bad code. Instead, it’s that they will generate plausible, confident code that lacks enough grounding in the realities of a codebase. For developers, this shifts where attention should go. The most valuable tools in an AI-assisted workflow aren’t the ones that generate the most code, but the ones that constrain it correctly. Strong type systems, reliable refactoring tools, and accurate semantic models become essential guardrails. They provide the structure that allows AI output to be reviewed, validated, and corrected efficiently instead of trusted blindly. Why open collaboration is critical Despite the challenges of funding and maintenance, Hejlberg remains optimistic about open collaboration. One reason is institutional memory. Years of discussion, decisions, and tradeoffs remain searchable and visible. That history does not disappear into private email threads or internal systems. It remains available to anyone who wants to understand how and why a system evolved. Despite the challenges of funding and maintenance, Hejlsberg remains optimistic about open collaboration. And a big reason is institutional memory. “We have 12 years of history captured in our project,” he explains. “If someone remembers that a discussion happened, we can usually find it. The context doesn’t disappear into email or private systems.” That visibility changes how systems evolve. Design debates, rejected ideas, and tradeoffs remain accessible long after individual decisions are made. For developers joining a project later, that shared context often matters as much as the code itself. A pattern that repeats across decades Across four decades of language design, the same themes recur: Fast feedback loops matter more than elegance Systems need to accommodate imperfect code written by many people Behavioral compatibility often matters more than architectural purity Visible tradeoffs build trust These aren’t secondary concerns. They’re fundamental decisions that determine whether a tool can adapt as its audience grows. Moreover, they ground innovation by ensuring new ideas can take root without breaking what already works. For anyone building tools they want to see endure, those fundamentals matter as much as any breakthrough feature. And that may be the most important lesson of all. Did you know TypeScript was the top language used in 2025? Read more in the Octoverse report > The post 7 learnings from Anders Hejlsberg: The architect behind C# and TypeScript appeared first on The GitHub Blog.
Read more →

How to secure Vertex AI pipelines with Google Cloud tools

AI models now power critical systems across many sectors. You’ll find them in healthcare, banking, cybersecurity, and defense. When you move these models to production on Vertex AI, the attack surface grows fast. Your data, model weights, pipelines, and APIs all face risks. In this guide, you’ll learn how to secure models built with Vertex AI, including data sources, model files, pipelines, and endpoints, using tools already built into Google Cloud. These include identity and access management (IAM), VPC Service Controls, data loss prevention, Artifact Registry, and Cloud Audit Logs. Each tool adds a new layer to your defense strategy. Together, they help build zero trust protection for your machine learning workloads. Why securing Vertex AI pipelines matters AI pipelines are attractive targets for attackers. Once compromised, they can affect models, systems, and even end users. Below are key threat vectors and how they affect real-world systems. Threat Vector Real-World Impact Data poisoning Manipulated training data → biased/inaccurate model Model theft (exfiltration) IP leakage of proprietary LLMs or classifiers Insecure pipeline execution Unauthorized access or lateral movement Unprotected inference APIs Prompt injection, model abuse, or DoS attacks These threats affect various parts of your machine learning (ML) workflow. These risks may cause data leaks, system failures, and even lost trust without the right security. So, knowing each one early helps you build safer and stronger AI systems. Security layers for Vertex AI workloads Each layer must be hardened individually and monitored continuously. Step-by-step: Securing Vertex AI models on GCP 1. Enforce IAM on datasets and pipelines Start by managing who can access your data and pipelines. Use identity and access tools in Google Cloud to set clear rules. Give each person or service only the access they truly need. For example, if someone only needs to read data, do not allow them to run training jobs. This prevents mistakes and stops attackers from moving through your system. Keeping access tight protects your data and keeps your machine learning projects safe. gcloud projects add-iam-policy-binding genai-project \ --member="user:ml-engineer@example.com" \ --role="roles/aiplatform.user" Restrict access to training datasets: gcloud projects add-iam-policy-binding genai-project \ --member="serviceAccount:training-sa@genai-project.iam.gserviceaccount.com" \ --role="roles/bigquery.dataViewer" 2. Scan training data for PII with DLP Before training your model, review the data for sensitive or personally identifiable information (PII). Use Google Cloud’s data loss prevention tools to identify and remove anything that shouldn’t be included. gcloud dlp inspect bigquery \ --dataset-id=training_dataset \ --table-id=users_raw \ --min-likelihood=LIKELY \ --info-types=EMAIL_ADDRESS,PHONE_NUMBER,NAME Automatically flag sensitive data before it enters your pipeline. 3. Use VPC Service Controls to isolate ML projects Keep your machine learning projects separate from the public internet. Set up VPC Service Controls to create secure boundaries around your data and services. This helps block unauthorized access from outside your network. gcloud access-context-manager perimeters create genai-perimeter \ --resources=projects/genai-project \ --restricted-services=aiplatform.googleapis.com,bigquery.googleapis.com It prevents data exfiltration from AI workloads to unauthorized services. 4. Secure model artifacts in Artifact Registry Store your models safely using Artifact Registry. This tool lets you track model versions and manage access. It lowers the risk of theft or tampering. gcloud artifacts repositories create genai-models \ --repository-format=docker \ --location=us-central1 \ --description="Private AI Model Store" Limit access to approved service accounts only: gcloud artifacts repositories add-iam-policy-binding genai-models \ --location=us-central1 \ --member="serviceAccount:ci-cd@genai-project.iam.gserviceaccount.com" \ --role="roles/artifactregistry.writer"5. Harden Vertex AI pipelines with workload identity Use Kubernetes service accounts linked to Google Cloud identities. This way, each pipeline component has its own secure identity. It prevents unauthorized actions and keeps your pipelines safe. gcloud iam service-accounts add-iam-policy-binding \ pipeline-sa@genai-project.iam.gserviceaccount.com \ --member="serviceAccount:genai-project.svc.id.goog[ml-pipelines/pipeline-runner]" \ --role="roles/aiplatform.customCodeServiceAgent" It prevents hardcoded credentials in Kubeflow or Cloud Build jobs. 6. Protect inference endpoints with IAP and rate limiting Secure your model’s endpoints using Cloud Endpoints and Identity-Aware Proxy. This controls who can access your models. Add rate limiting to stop misuse and reduce the risk of attacks. gcloud compute backend-services update genai-inference \ --iap=enabled,oauth2-client-id=CLIENT_ID,oauth2-client-secret=SECRET Add quota restrictions to prevent abuse: Quota: limits: - name: predict-requests metric: "ml.googleapis.com/predict" unit: "1/min/{project}" values: STANDARD: 100 7. Enable audit logging for full visibility Turn on audit logging to track all actions on your AI resources. This helps you spot unusual activity quickly and fix problems before they grow. gcloud logging sinks create vertex-logs-sink \ bigquery.googleapis.com/projects/genai-project/datasets/audit_logs \ --log-filter='resource.type="aiplatform.googleapis.com/PipelineJob"' Use Looker Studio or BigQuery to visualize: Pipeline executions Use BigQuery to query execution logs Use Looker Studio to create charts from those logs Model deployment events Use BigQuery to query deployment event data Use Looker Studio to visualize deployment timelines and statuses Data access logs Use BigQuery to query access logs Use Looker Studio to build dashboards showing access patterns Vertex AI Security Checklist Security Control GCP Tool / Layer IAM on pipelines and data Cloud IAM + conditions Sensitive data detection Cloud DLP + BigQuery Artifactintegrity Artifact Registry + signed images Network isolation VPC Service Controls Pipeline authentication Workload Identity Federation Inference access control IAP + quotas + OAuth2 Audit and drift detection Cloud logging +Security Command Center + Recommender This table lists key security controls and their related GCP tools. It covers access management, data protection, artifact security, and network isolation. Tools like Cloud IAM, Cloud DLP, Artifact Registry, VPC Service Controls, and Workload Identity enforce these controls efficiently. Conclusion Securing AI models is not just about the infrastructure. It is all about keeping trust in the system. You can set powerful machine learning models with Vertex AI. However, without the right controls, you risk data leaks, IP theft, and attacks. Using a layered defense approach helps protect your AI workloads from raw data to deployment. Key tools include IAM, DLP, VPC Service Controls, and Artifact Registry. In 2026, AI security is cloud security. If you deploy ML pipelines on Google Cloud, treat your models as valuable assets. Build strong defenses to keep them safe. The post How to secure Vertex AI pipelines with Google Cloud tools appeared first on The New Stack.
Read more →

Ai2 makes building custom coding agents easier and cheaper

The Allen Institute for AI (Ai2) is launching a new family of open coding agents today that, as standalone models, outperform similar-sized models on standard benchmarks. But what makes this project stand out is that Ai2 is also open sourcing a collection of tools that lets anyone fine-tune the model based on their private codebases, documentation, and other materials for significantly better performance on domain-specific tasks. “Over the past year, coding agents have transformed how developers write, test, and maintain software. These systems can debug, refactor, and even submit pull requests — fundamentally changing what software development looks like,” Ai2 writes in today’s announcement. “Yet despite this progress, most coding agents share the same constraints: They’re closed, expensive to train, and difficult to study or adapt to private codebases.” The cost of doing so? $400 to replicate Ai2’s results and just over $2,000 for the best performance. Comparable approaches, Ai2 notes, can cost up to 11 times more. To train its so-called SERA (Soft-verified Efficient Repository Agents) model, the first in Ai2’s Open Coding Agents collection, the team used a cluster of two Nvidia H100s. Credit: Ai2. Two models, full recipes included Ai2 is launching two models under the SERA moniker: SERA-32B, which bests other models like Qwen3-Coder and Mistral’s Devstral Small 2, and SERA-8B. SWE-Bench Verified is a benchmark that tests whether AI coding agents can resolve real-world GitHub issues from a subset of popular Python repositories. This smaller model only solves 29.4% of SWE-Bench Verified problems, but that’s still well above similar-sized open models. The large model, however, solves 55% of those problems. Credit: Ai2. The team is releasing the models, code, all generated agent data, and the full recipe for any team to generate their own data. One of the interesting results of the team’s research was that the smaller fine-tuned model would often replicate, and at times, exceed the performance of its larger “teacher” coding agent. A 32B model, when fine-tuned on a company codebase, often outperforms its 100B-parameter teacher model. How Ai2 cut training costs At the core of Ai2’s efforts to keep the models both performant and affordable are two innovations, the team explains. The first is soft-verified generation (SVG). As Ai2 notes, when creating synthetic training data for these kinds of models, the traditional approach is to create pairs of incorrect code and its corrected version. Counterintuitively, Ai2 found that including only partially correct solutions in the training set still produced models that generate fully correct code. Creating the traditional set of “hard-verified” incorrect/corrected code pairs necessitates a lot of thorough, compute-intensive testing. But as it turns out, that isn’t necessary. The second innovation is that to diversify the training dataset, Ai2 created a taxonomy of 51 bug patterns. Its tools then generate prompts for bugs in each function in a repository, yielding what the team calls “thousands of varied agentic trajectories at low cost.” As it turns out, training on realistic developer workflows matters more than perfectly verified code pairs. “We believe bringing the cost of replicating strong coding agents down to a few hundred dollars will unlock research that simply wasn’t possible before,” the Ai2 team writes. “Instead of being limited to a handful of well-funded labs, agentic coding can become a widely accessible direction for small teams, students, and independent developers.” The post Ai2 makes building custom coding agents easier and cheaper appeared first on The New Stack.
Read more →

RAG isn’t dead, but context engineering is the new hotness

So, is RAG (Retrieval-Augmented Generation) dead now? Last May I asked that question of Douwe Kiela, CEO of Contextual AI, based on the growing hype around MCP (Model Context Protocol). Both are data retrieval mechanisms for Large Language Models, but it’s MCP that has taken all the headlines over the past year. The truth is, RAG has fallen away as a term used by developers and AI engineers. Even Kiela, who co-authored the 2020 academic paper that introduced RAG to the world, admits that a trendy new term has taken over. “I think people have rebranded it now as context engineering, which includes MCP and RAG,” he said. “I mean, the ‘R’ in RAG just stands for ‘retrieval.’ So, I think I said this last time too, if you’re using MCP to do your retrieval, then it’s basically RAG, right?” RAG is still an integral part of Contextual AI’s stack — it’s in their documentation, despite no longer rating a mention on the homepage. Regardless, Contextual AI chose the right company name if “context engineering” is the term du jour now. Agent Composer Launch Like many other AI companies, Contextual AI is also now all-in on agents. This week it launched a new tool called Agent Composer, which the company described in a press release as “the infrastructure and orchestration layer that manages context, enforces guardrails, and maintains agent reliability throughout multi-step engineering workflows.” Agent Composer joins the other tools available on the Contextual AI platform, which Kiela describes as a “context layer.” “So you have the language model, you have your data,” he explained. “And if you’re an enterprise, you have your data all over the place, and it’s very, very noisy, right? And these companies are not going to consolidate all of that data into one place, so what you can do with our platform is you can hook up all these different data sources.” From all those data sources, users create what Contextual AI calls “data stores.” Part of what Agent Composer will do, says Kiela, is help enterprises build agents on top of their data stores. As the diagram below shows, Agent Composer includes all the pieces an enterprise would want to create agents: pre-built templates, a prompting interface, a visual builder, APIs, and so on. Contextual AI platform; image via the company. Claude Code and Enterprise Wrappers I noted that AI coding tools like Claude Code and Cursor have been tremendously popular in enterprises over the past year or so. Presumably, many enterprise developers are already using those tools to create custom agents, so what does Contextual AI’s Agent Composer offer that the likes of Claude Code don’t? “I would say that those [AI coding tools], they’re essentially harnesses for language models,” Kiela replied. “So ‘harness’ is one of the buzzwords right now. So I think you can think of our platform as a way to create ‘custom harnesses.’ You can basically build your own Cursor, or you can run your own specific instance of Claude Code on our platform, so that you don’t have to worry about running things locally, or things like that.” I think what he means is that Claude Code and Cursor are wrappers around an AI model, but they’re often tied to a developer’s computer by being a CLI tool or a desktop app. Contextual allows enterprises to create their own wrappers, but they’re hosted centrally — which comes with the security and governance benefits that enterprises typically require. “…you can think of our platform as a way to create ‘custom harnesses.’ You can basically build your own Cursor, or you can run your own specific instance of Claude Code on our platform.” – Douwe Kiela, CEO of Contextual AI Another big trend currently is agent development platforms. LangChain, perhaps the original AI engineering tool, is currently promoting its “agent engineering platform” — called LangSmith — on its homepage. I asked Kiela how Contextual AI compares to a product like LangSmith? “I think they’re more focused on lower-level developers and what I would call more indie developers,” he replied. “So it’s really about SaaS prototyping, and they have lots of different options. I think we are much more opinionated and much more enterprise grade, so we’re really focused on enterprise developers and users of [those] solutions.” From Prompt Engineering to Context Engineering Terminology changes so fast in the AI era of development. So what does “context engineering” even mean, in relation especially to AI agents? It just so happens that Anthropic, perhaps the most trendy AI development company right now, thanks to Claude Code, wrote an explainer last September. Anthropic contends that “context engineering is the natural progression of prompt engineering.” Rather than giving a series of prompts to an LLM, as in the old days of 2022-2023, engineers are now encouraged to manage “the entire context state (system instructions, tools, Model Context Protocol (MCP), external data, message history, etc).” The term “agent” itself is problematic, but most people agree that it’s a software program that runs in a loop. According to Anthropic, an agent “running in a loop generates more and more data that could be relevant for the next turn of inference, and this information must be cyclically refined.” So that’s what context engineering does. “…there’s always a trade-off between how much information you want to pre-process […] and how much information you want to search during query time.” – Kiela Specifically, Anthropic says that Claude Code takes a “just in time” approach to context engineering, meaning it will “dynamically load data into context at runtime using tools.” I asked Kiela if Contextual AI does a similar thing? “Yeah, so, most of these solutions are just-in-time,” he said. “If you sort of zoom out, there’s always a trade-off between how much information you want to pre-process — so when you do the ingestion of documents — and how much information you want to search during query time… so, just-in-time, essentially. And so the right trade-off between those two modes of processing really depends on the problem that you’re solving. So in some cases, if you have to be blazingly fast, you probably want to do a lot more pre-processing. If you have a bit more time and you can be agentic, then maybe you don’t need to do as much of that, because you can have multiple tries and all kinds of different strategies for getting to the answer.” Agentic Use Cases So what kinds of agentic solutions are Contextual AI’s customers actually implementing currently? Kiela replied that his company tends to focus on “hard engineering,” like the semiconductor industry. “So within that, we see a lot of traction around enabling engineers to move faster by having access to all of the internal knowledge, so kind of unlocking institutional engineering knowledge,” he said. One of their more popular use cases is doing a root cause analysis with an agent, a process described in a November blog post. “So that’s quite powerful,” he continued. “It’s really taking log dumps or all kinds of different data sets around something going wrong, and then you need to analyze what the root cause is. You can cross-reference that with internal documentation, maybe with existing bug reports. Maybe you want to automatically open up a PR on your code base that fixes it. So there’s a lot of interest in that.” Conclusion In summary, then, RAG is not dead — it’s just been rebranded to “context engineering.” Also, it’s clear that the practice of software engineering in the agentic era continues to evolve. Companies like Contextual AI and Anthropic provide tools for a range of developers to tweak agent loops. Prompting? That’s so over. Now it’s about managing “the entire context state,” as Anthropic puts it. The post RAG isn’t dead, but context engineering is the new hotness appeared first on The New Stack.
Read more →

Help shape the future of open source in Europe

At GitHub, we believe that open source is a primary driver of innovation, security, and economic competitiveness. The European Union is currently at a pivotal moment in defining how it supports this ecosystem, and it wants to hear from you, the builders. The European Commission is planning to adopt an open source strategy called “Towards European Open Digital Ecosystems“. This initiative is not about passing new laws; instead the EU is looking to develop a strategic framework and funding measures to help the EU open source sector scale up and become more competitive. This effort aims to strengthen the EU’s technological sovereignty by supporting open source software and hardware across critical sectors like AI, cloud computing, and cybersecurity. We’ve been advocating for this kind of support for a long time. For instance, we previously highlighted the need for a European Sovereign Tech Fund to invest in the maintenance of critical basic open source technologies such as libraries or programming languages. This new strategy is a chance to turn those kinds of ideas into official EU policy. You can read GitHub’s response to the European Commission here. Brand new data from GitHub Innovation Graph shows that the EU is a global open source powerhouse: There are now almost 25 million EU developers on GitHub, who made over 155 million contributions to public projects in the last year alone. The EU wants to help European companies turn open source projects into successful businesses, which is an admirable goal with plenty of opportunities to achieve it. For example, the EU can create better conditions for open source businesses by making it easier for them to participate in public procurement and access the growth capital they need to turn great code into sustainable products. By supporting the business models and infrastructure that surround it, the EU can turn its massive developer talent into long-term economic leadership. It is important to understand, though, that not all open source projects can be turned into commercial products—and that commercialization is not every developer’s goal. A successful EU open source policy should also support the long-term sustainability of non-commercially produced open source components that benefit us all. That is why the European Commission needs to hear the full spectrum of experiences from the community—from individual maintainers, startups, companies, and researchers. Over 900 people have already shared their views, and we encourage you to join them. The European Commission is specifically looking for responses covering these five topics: Strengths and weaknesses: What is standing in the way of open source adoption and sustainable open source contributions in the EU? Added value: How does open source benefit the public and private sectors? Concrete actions: What should the EU do to support open source? Priority areas: Which technologies (e.g., AI, IoT, or Cloud) should be the focus? Sector impact: In which industries (e.g., automotive or manufacturing) could open source increase competitiveness and cybersecurity? How to Participate The “Call for Evidence” is your opportunity to help shape the future tech policy of the EU. It only takes a few minutes to provide your perspective. Submit your feedback by February 3 (midnight CET). Your voice is essential to ensuring that the next generation of European digital policy is built with the needs of real developers in mind. At GitHub Developer Policy, we are always open to feedback from developers. Please do not hesitate to contact us as well. The post Help shape the future of open source in Europe appeared first on The GitHub Blog.
Read more →

Chainguard EmeritOSS backs MinIO, other orphaned projects

Open source has a problem. There are many under-supported, or even abandoned, open source programs that are still in wide use, but there’s no one at the driver’s wheel. To address this issue, Chainguard recently launched Chainguard EmeritOSS, a project to support these vital, but unloved, programs. After putting its support behind three different programs, the infrastructure security company is coming to the rescue of 10 more programs. Perhaps the chief one is MinIO, a lightweight, high-performance, open source object storage system that’s fully Amazon S3 API-compatible. In December, the maintainers put the software under maintenance-only mode, much to the consternation of the community that still used the community edition. The namesake company, previously in charge of the project, recommends the free edition (though not open source) or the commercial edition of its Alstor platform instead. Chainguard ramped up support and even offers a secure MinIO image. Other newly supported programs Other newly supported zombie programs include: Prometheus PushProx, a proxy and client solution that enables Prometheus to scrape targets even if they’re hidden behind NATs or firewalls. While PushProx still “pulls” the data in, behind the scenes, it runs a tunneling proxy that “pushes” data requests to retrieve the data. Cassandra Exporter is a standalone metrics exporter for Apache Cassandra. This Java Virtual Machine (JVM) program retrieves Cassandra performance and usage metrics without overburdening the Cassandra NoSQL DBMS. Prometheus exporter scrapes JavaScript Object Notation (JSON) APIs and turns them into metrics using the JSONPath configuration. With this useful tool you can pull in data from almost any API that understands JSON. Prometheus exporter for RabbitMQ exposes broker, queue, connection, and exchange stats via the Management API. This exporter works with legacy RabbitMQ 3.x versions. It provides extensive filtering and configuration capabilities for monitoring RabbitMQ infrastructure. It’s often used for message-queue monitoring and alerting. The Prometheus exporter for Python RQ (Redis Queue) exposes job-queue metrics, including processing time and counts. This enables managers to monitor background workloads more effectively via an HTTP endpoint, typically “/metrics,” that Prometheus can then scrape for data. When I was a young Unix developer, I’d just use grep, awk, and sed, but the Logstash filter range plugin lets you define numeric or string ranges and check whether a given field’s value falls within them without writing shell scripts by hand. Armed with this data, you tag events, drop unwanted data, apply conditional processing, and you get the idea. PgCat is a PostgreSQL connection pooler and proxy that supports sharding, load balancing, failover, and mirroring. It can multiplex client connections to PostgreSQL DBMSs to cut down connection overhead and reduce network latency. The OpenStack Velero plugin adds backup and restore operations to Velero for OpenStack Cinder volumes, Swift containers, and Manila shares. It provides volume snapshotting and object storage capabilities for OpenStack environments. Without it, Velero is used to back up and restore Kubernetes clusters running on OpenStack. Finally, the k8s-node-collector is a small utility that provides a Kubernetes node information collector to gather file system, process, and system data. It produces structured JSON output for auditing, compliance checks, or custom integrations. No more support Of course, what all these programs have in common is that their creators no longer support them. As Kim Lewandowski, Chainguard’s CSO and co-founder, wrote in the blog post announcing this news, “When a project no longer requires continuous upkeep or the maintainers need to step away, Chainguard EmeritOSS [steps] in.” This is a very useful service. There are far too many mission-critical open source programs that no longer have a home, and Chainguard is giving them one. As Lewandowski put it, “EmeritOSS exists for the projects that have earned their stripes. They’ve shipped, scaled, and supported real systems, and while their maintainers may be ready to step back, the software itself still has plenty of life left.” Hand off unsupported projects Indeed, they do. As Chainguard co-founder and CEO, Dan Lorenc explained in an earlier The New Stack column: “We need a way for open source maintainers to gracefully hand off ‘done’ projects even when they no longer have a significant feature roadmap. We need to offer them a place where: Mature projects can transition from individual maintainers to a trusted organization accountable for long-term stewardship. CVEs get patched continuously, even without new feature work. Reproducibility and trust remain, without weekly commits.” That place is EmeritOSS. Do you need these programs? Chainguard’s forked, stability-focused EmeritOSS versions will remain freely available on GitHub in source code. Don’t want to fuss with the code? Chainguard also offers secure, continuously maintained container images and APK packages through its commercial distributions. Are you depending on another open source program and need help? You can submit it for consideration, and Chainguard might support it via EmeritOSS. The post Chainguard EmeritOSS backs MinIO, other orphaned projects appeared first on The New Stack.
Read more →

QCon chat: Is agentic AI killing continuous integration?

In the age of AI, will we still need continuous integration (CI) at all? One panelist in a QCon AI conference panel on AI and engineering asked this, perhaps deliberatively provocative, question: Will AI kill CI? While many at the event quickly dismissed the notion that AI could go that far as to actually eliminate CI, the question resonated in the halls of the conference, held within the scholarly confines of the New York Academy of Medicine in Manhattan’s Upper East Side. It turned out to be one of the most hotly-discussed topics at the event. And many people agreed that the software development lifecycle will have to change in the era of AI. Daniel Doubrovkine, who has worked in engineering positions at Shopify, AWS and Artsy.Net and recently took on a VP role at Microsoft, initially floated the question of if AI would kill CI altogether during a panel. He had recently visited operations at Meta and was surprised how few tests the company actually ran before pushing new code to production, where developers can run many tests locally on their laptops (“Shift Left“) before pushing code. “I think AI gives us a new opportunity to rethink how we work,” he said, noting it also gives us a chance to get rid of unnecessary tasks that have been built up on the way. The pull request (PR) is the heart of a CI system, kicking off a whole series of tests to the software before it is merged into production. But “There’s no fundamental law of the universe that says that a PR review or a code review has to happen before the code is deployed,” agreed Michael Webster, principal engineer for CI/CD service provider CircleCI, in his own talk. “There are a lot of compliance tools that say that has to happen, and those are important. But this is not a fundamental fact of software delivery.” It doesn’t have to be this way — CircleCI’s Michael Webster (Google Gemini recreation of Webster’s slide.) AI is breaking the software delivery lifecycle We think of the development lifecycle as a linear series of discrete steps. “You push your code. You build, then you test, then you deploy,” Webster said. “That model doesn’t hold up with AI,” he said. Webster’s own QCON talk was about how AI, and agentic systems are changing the software delivery lifecycle. CircleCI is a CI/CD provider, processing over a billion customer jobs annually. CircleCI’s Michael Webster From what CircleCI is seeing within its own customer base, the software industry is on the cusp of using a lot of headless agents, which can take on long-running tasks on a schedule or be activated via webhooks. Headless agents do well at mechanical translations, once given a solid set of rules to work from. A well-structured repository is key. One project at CircleCI where agentic agents helped was a project to bring dark mode to CI/CD’s software. The design team specified the attributes required, and the agent did the laborious duty of going through all the user-facing components to make the changes. “All in all, we’ve seen that this pairing of domain expertise plus AI is a really powerful organization attribute, because it allows more people to contribute,” Webster said. By Webster’s estimate, through Google’s GitHub Archive for BigQuery, GitHub is now incurring hundreds of thousands of agent-related activities per week. What are they doing? Pull requests. But an AI-fueled project can create an immense amount of code, which creates its own bottleneck. “You have AIs pushing as much code as they are writing,” Webster said. Circle CI is also seeing this behavior with its own customers. The problems around pull requests On average, a code reviewer could be able to inspect 500 lines of code in an hour. When an agentic service can produce 1,500 lines of code every 10 minutes, there is bound to be a traffic jam. Presentation at QCon. Beyond the numbers problem, pull requests are “inefficient generally,” Webster said. By many accounts, the median time that a PR review team can take reviewing code can range from 14 hours down to three, in cases when a single engineer relentlessly pushes one PR through. Reviewing PRs takes you out of the flow, and the information provided would have been more useful earlier in the development cycle. Persistent technical debt accumulation is also a problem with this tsunami of PRs. Headless agents working autonomously can work quickly, but also sloppily. The most recent DORA survey reports found the same: increased velocity, but more unstable. In one paper, a group of researchers found that adopting an AI service, such as Cursor, can provide a temporary gain in code development, though the project’s velocity will soon be hampered by “static analysis warnings and code complexity.” And in his own mathematical calculations, Webster estimated that any gains achieved with once AI-generating code would become useless once AI becomes 75% faster overall than human coders. “If you’re not able to complement to speed up your delivery, compared to AI, it’s all going to be washed out by all of the delays in the process,” Webster said. In other words, “the reality is, even if you did have AI going as fast as you wanted to, you as an organization, and the objective that you’re trying to achieve, couldn’t go faster even if you wanted to.” There are things that you can do, such as optimizing pipelines, rewriting scripts, paralyzing tests, and better code reviews, which will all help. Agentic activity on the part of CircleCI customers. AI-generated code requires more nimble testing But perhaps the best answer is to rethink the testing and validation process to let agents do as much of the work as possible. “If you have a way to validate the AI, you can let it run as fast as possible,” Webster said. Develop a set of tests that assert that if the code passes the tests, it should go to production. As others have pointed out, failure is a data set that AI itself can use to fine-tune its own process. Thorough Unit tests are good for this, though are limited in scalability (to about 10x the human-driven workload, Webster estimated). A better approach is test impact analysis to speed testing, through incremental validation, pruning tests to only what is needed, as highlighted by a dependency graph. CircleCI applied it to its own monolithic user interface application and found that it cut test timing from 30 minutes down to 1.5 minutes. What this means is we can take an AI agent, have it work as fast as we’re willing to spend money on the tokens, and give it a tool to run only the test that it needs to run on the changes that it needs,” Webster said. Such an operation can be easily run from within a container or a laptop. The principle of selective attention can apply to code review. “Not all code has the same level of risk,” he said. “Here is where you can prune back review to just the changes that matter,” Webster said. Circle CI has built its own agent, called Chunk, for customers to run to streamline their own testing processes. Future build systems will be less linear Future engineers will be worrying less about the code and more about supporting the AI in its relentless pursuit of generating more code, Webster predicted. So tasks like fixing flaky tests will become the first priority, and can be automated as well. Instead of this linear process, we will need to build systems where all the required tests take place somewhere in the process. “Instead of having a linear Yes/No, we combine these things into a single gate, where all we do is keep track of what has occurred,” Webster said. If a test passes, the coder should be moved to production. “Everything else besides that is us concerned about other things.” With AI, “more effort and energy is likely going to be spent in this testing and evaluation, there is less so thinking about the specific designs of low-level details of our services.” Full access to these QCon AI talks, and others, can now be procured through a video-only pass. The post QCon chat: Is agentic AI killing continuous integration? appeared first on The New Stack.
Read more →

Ask HN: Notification Overload

Comments
Read more →

There’s a Hidden Preference to Auto-Resize Columns in the Finder on MacOS 14 and 15

Good tip from “DifferentDan” on the Realmac customer forum, posted back in November: I saw on macOS Tahoe 26.1, Apple finally added an option in the Column View settings to automatically right size all columns individually and that setting would persist, but I don’t really like Liquid Glass (yet) so I haven’t updated to Tahoe. Looks like someone found a workaround however for those that are still on Sequoia. Just open up Terminal on your Mac, copy in the below, and press return. The one-line command: defaults write com.apple.finder _FXEnableColumnAutoSizing -bool YES; killall Finder (Change YES to NO if you want to go back.) Marcel Bresink’s TinkerTool is a great free app for adjusting hidden preferences using a proper GUI, and it turns out TinkerTool has exposed this hidden Finder preference for a few years now. You learn something every day. I enabled this a few days ago on MacOS 15 Sequoia, and it seems exactly like the implementation Apple has exposed in the Finder’s View Options window in Tahoe, which I wrote about Friday. No better, no worse. ★
Read more →

Nvidia Set to Supplant Apple as TSMC’s Largest Customer

Kif Leswing, CNBC: Nvidia will become TSMC’s largest customer this year, according to analyst estimates and Huang himself. Apple is believed to currently be TSMC’s largest customer, mostly to manufacture A-series chips for iPhones and M-series chips for PCs and servers. The positional swap will mark a fundamental shift in the semiconductor industry, reflecting Nvidia’s growing importance amid the artificial intelligence infrastructure build-out. [...] Ben Bajarin, principal analyst at Creative Strategies, said he projects Nvidia to generate $33 billion in TSMC revenue this year, or about 22% of the chip foundry’s total. Apple, by comparison, is projected to generate about $27 billion, or about 18% of TSMC’s revenue. ★
Read more →

[Sponsor] WorkOS Pipes: Ship Third-Party Integrations Without Rebuilding OAuth

Connecting user accounts to third-party APIs always comes with the same plumbing: OAuth flows, token storage, refresh logic, and provider-specific quirks. WorkOS Pipes removes that overhead. Users connect services like GitHub, Slack, Google, Salesforce, and other supported providers through a drop-in widget. Your backend requests a valid access token from the Pipes API when needed, while Pipes handles credential storage and token refresh. Simplify integrations with WorkOS Pipes. ★
Read more →

Airlines That Support Shared Item Location for Luggage With AirTags

Joe Rossignol, writing at MacRumors: Apple offers a Share Item Location feature in the Find My app that allows you to temporarily share the location of an AirTag-equipped item with others, including employees at participating airlines. This way, if you put an AirTag inside your bags, the airline can better help you find them in the event they are lost or delayed at the airport. [...] Below, we have listed most of the airlines that support the feature. Apple’s announcement claims that 36 airlines support it today, and 15 more are coming soon. ★
Read more →

Apple Introduces Second-Generation AirTags

Apple Newsroom: Apple’s second-generation Ultra Wideband chip — the same chip found in the iPhone 17 lineup, iPhone Air, Apple Watch Ultra 3, and Apple Watch Series 11 — powers the new AirTag, making it easier to locate than ever before. Using haptic, visual, and audio feedback, Precision Finding guides users to their lost items from up to 50 percent farther away than the previous generation. And an upgraded Bluetooth chip expands the range at which items can be located. For the first time, users can use Precision Finding on Apple Watch Series 9 or later, or Apple Watch Ultra 2 or later, to find their AirTag, bringing a powerful experience to the wrist. Solid update to the original AirTags, which debuted five years ago. Better range, louder speaker, increased precision. The form factor remains unchanged, so second-gen AirTags will fit in keychains or holders designed for the first-gen model. They even take the same batteries. Pricing also remains unchanged: $29 for one, $99 for a four-pack. ★
Read more →

★ App Store 2025 Top iPhone Apps in the U.S.

I’ve been meaning since last month to link to Apple’s lists of the top iPhone apps in the U.S. for 2025. Here’s the list of the top 20 free iPhone apps: ChatGPT Threads Google TikTok — Videos, Shop & LIVE WhatsApp Messenger Instagram YouTube Google Maps Gmail — Email by Google Google Gemini Facebook CapCut: Photo & Video Editor Temu: Shop Like a Billionaire T-Life [“All things T-Mobile”] Telegram Messenger Lemon8 — Lifestyle Community Spotify: Music and Podcasts Google Chrome Snapchat rednote All app names are verbatim, except for T-Life, where I put the app’s secondary slogan in brackets. I had no idea what T-Life was, but the slogan makes it clear. Interesting to me that T-Mobile’s app is on the list but neither Verizon nor AT&T’s are.1 I hope a million people sent this list to Elon Musk, to rub some salt in his severe case of butt hurt that led him to file an almost certainly baseless lawsuit in August alleging that ChatGPT consistently tops the App Store list — and Grok does not — because Apple puts a thumb on the scale for these rankings because of its deal with OpenAI to integrate ChatGPT with Apple Intelligence. Here’s the thing. Dishonest people presume the whole world is dishonest. That you either cheat and steal, or you’re going to be cheated and robbed. If Elon Musk ran the App Store, you can be sure that he’d cook the rankings to put apps that he owns, or even just favors, on top. Elon Musk runs Twitter/X, and that’s how the algorithm there now works: it favors content he prefers, especially his own tweets. Apple doesn’t publish how its lists for top apps are computed (to keep the rankings from being gamed more than they already inevitably are), but judging by how many of these apps come from Apple’s rivals (e.g., Spotify), there’s little reason to think they’re crooked — unless you think the entire world is crooked. Google has 6 apps on the list, including 5 in the top 10. Meta — certainly no friend of Apple — has 4 apps on the list, including 3 in the top 10. (Slightly interesting, but unsurprising, sign of the times: the Facebook “blue app” dropped out of the top 10.) The only apps in the top 10 not from Google or Meta are ChatGPT (#1) and TikTok (#4). Microsoft has no apps on the list. Back in the day, the conventional wisdom was that Microsoft made more money, on average, from each Mac sold than they did from each PC sold — despite the fact that nearly all PCs came with a licensed version of Windows — because so many Mac users paid for Microsoft Office at retail prices. I suspect something like that is true with iPhones for Google. A lot of iPhone users spend a lot of time using apps from Google. I would bet that Google makes more ad revenue from the average iPhone user (who, even if they don’t install a single one of Google’s native iOS apps, probably uses Google Search in Safari) than from the average Android user. Another company that has no apps on this list is Apple itself. If you look at the daily top list of apps in the Productivity category, you will see a lot of apps from Google and Microsoft. But you won’t find Keynote, Pages, or Numbers, because Apple recuses its own apps from such rankings. Here’s the list of the top 20 paid iPhone apps in 2025 in the U.S.: HotSchedules Shadowrocket Procreate Pocket AnkiMobile Flashcards Paprika Recipe Manager 3 SkyView® TonalEnergy Tuner & Metronome AutoSleep Track Sleep on Watch Forest: Focus for Productivity RadarScope Monash FODMAP Diet Merge Watermelon for watch Streaks Wipr 2 µBrowser: Watch Web Browser PeakFinder Threema. The Secure Messenger Things 3 Goblin Tools ¡Verify Basic There are a couple of real gems on this list — Procreate, Paprika, Streaks (multi-time DF sponsor), and Things are all apps that I use, or have used, and would recommend. But unlike the list of top free apps, where I’d at least heard of all of them (once I figured out what T-Life was), I have never even heard of most of these paid iPhone apps. Household names these are not. The market for paid apps isn’t just different from the market for free apps. It’s an entirely different world. This, in turn made me wonder what the subscriber-count standings look like. I assumed T-Mobile was still in third place, but that assumption was wrong. According to Wikipedia, here are the number of U.S. subscribers per carrier as of Q3 2025: Verizon — 146 million T-Mobile — 140 million AT&T — 119 million Boost — 8 million I’m a Verizon man myself, and pay handsomely for it. I don’t even remember why exactly, but I despised AT&T back when they were the exclusive U.S. carrier for the iPhone. ↩︎
Read more →

From the DF Archive: ‘Untitled Document Syndrome’

Yours truly back in 2009, hitting upon the same themes from the item I just posted about TextEdit vs. Apple Notes: This, I think, explains the relative popularity of Mac OS X’s included Stickies application. For years, Stickies’s popularity confounded me. Why would anyone use a note-taking utility that requires you to leave every saved note open in its own window on screen? The more you use it, the more cluttered it gets. But here’s the thing: cluttered though it may be, you never have to save anything in Stickies. Switch to Stickies, Command-N, type your new note, and you’re done. (And, yes, if you create a new sticky note, then force-quit Stickies, the note you just created will be there when next you launch the app. Stickies’s auto-save happens while you type, not just at quit time.) It feels easy and it feels safe. Stickies does not offer a good long-term storage design, but it offers a frictionless short-term jot-something-down-right-now design. Here we are in 2026, 17 years later, and, unsurprisingly, some things have changed. Apple Notes didn’t get a Mac version until Mac OS X 10.8 Mountain Lion in 2012. And Apple Notes didn’t really get good until 2016 or 2017. I still use Yojimbo, the library-based Mac app I wrote about in the above piece in 2009, but I don’t use it nearly as much as I used to. I use Apple Notes instead, for most notes, because it has good clients for iPhone and iPad (and Vision Pro and even Apple Watch). Other things, however, have not changed since 2009. Like the Stickies app, which is still around in MacOS 26 Tahoe, largely unchanged, except for a sad Liquid Glass-style icon. If you still use Stickies, you should consider moving to Apple Notes. There’s even a command (File → Export All to Notes...) to import all your notes from Stickies into Apple Notes, with subfolders in Notes for each color sticky note. Apple Notes on the Mac even supports one of Stickies’s signature features: the Window → Float on Top command will keep a note’s window floating atop the windows from other apps even when Apple Notes is in the background. (Stickies has another cool feature that no other current app I know of does: it still supports “window shading”. Double-click the title bar of a note in Stickies and the rest of the window will “roll up”, leaving only the title bar behind. Double-click again and it rolls down. This was a built-in feature for all windows in all apps on classic Mac OS, starting with Mac OS 8, but was replaced in favor of minimizing windows into the Dock with Mac OS X. Window shading was a better feature (and could have been kept alongside minimizing into the Dock). With the Stickies app, window shading works particularly well with the aforementioned Float on Top feature — you can keep a floating window available, atop all other windows, but while it’s rolled up it hardly takes up any space or obscures anything underneath.) ★
Read more →

‘TextEdit and the Relief of Simple Software’

Perhaps at the opposite end of the complexity and novelty spectrum from Federico Viticci’s intro to Clawdbot is this piece by Kyle Chayka, writing at The New Yorker, from October: Amid the accelerating automation of our computers — and the proliferation of assistants and companions and agents designed to execute tasks for us — I’ve been thinking more about the desktop that’s hidden in the background of the laptop I use every day. Mine is strewn with screenshots and Word documents and e-books. What I’ve accrued the most of by far, though, are TextEdit files, from the bare-bones Mac app that just lets you type stuff into a blank window. Apple computers have come with text-editing software since the original Mac was released, in 1984; the current iteration of the program launched in the mid-nineties and has survived relatively unchanged. Over the past few years, I’ve found myself relying on TextEdit more as every other app has grown more complicated, adding cloud uploads, collaborative editing, and now generative A.I. TextEdit is not connected to the internet, like Google Docs. It is not part of a larger suite of workplace software, like Microsoft Word. You can write in TextEdit, and you can format your writing with a bare minimum of fonts and styling. Those files are stored as RTFs (short for rich-text format), one step up from the most basic TXT file. TextEdit now functions as my to-do-list app, my e-mail drafting window, my personal calendar, and my stash of notes to self, which act like digital Post-its. I trust in TextEdit. It doesn’t redesign its interface without warning, the way Spotify does; it doesn’t hawk new features, and it doesn’t demand I update the app every other week, as Google Chrome does. I’ve tried out other software for keeping track of my random thoughts and ideas in progress — the personal note-storage app Evernote; the task-management board Trello; the collaborative digital workspace Notion, which can store and share company information. Each encourages you to adapt to a certain philosophy of organization, with its own formats and filing systems. But nothing has served me better than the brute simplicity of TextEdit, which doesn’t try to help you at all with the process of thinking. Using the app is the closest you can get to writing longhand on a screen. I could make lists on actual paper, of course, but I’ve also found that my brain has been so irredeemably warped by keyboards that I can only really get my thoughts down by typing. Old habits are hard to break. And trust me, I, of all people, know the value of writing stuff — all sorts of stuff — in plain text files. (RTF isn’t plain text, but it is a stable and standard format.) I’ve been using BBEdit since 1992, not just as an occasional utility, but as part of my daily arsenal of essential tools. But I get the feeling that Chayka would be better served switching from TextEdit to Apple Notes for most of these things he’s creating. Saving a whole pile of notes to yourself as text files on your desktop, with no organization into sub-folders, isn’t wrong. The whole point of “just put it on the desktop” is to absolve yourself of thinking about where to file something properly. That’s friction, and if you face a bit of friction every time you want to jot something down, it increases the likelihood that you won’t jot it down because you didn’t want to deal with the friction. You actually don’t need to save or name documents in TextEdit anymore. One of the best changes to MacOS in the last two decades has been the persistence of open document windows, including unsaved changes to existing files, and never-saved untitled document windows. Try this: open TextEdit, make a new untitled document, and type something — anything — into the new window. Next, don’t just quit TextEdit, but force quit it (⌥⌘Esc). Relaunch TextEdit, and your unsaved new document should be right where you left it, with every character you typed. But a big pile of unorganized RTF files on your desktop — or a big pile of unsaved document windows that remain open, in perpetuity, in TextEdit — is no way to live. You can use TextEdit like that, it supports being used like that, but it wasn’t designed to be used like that. Apple Notes was designed to be used like this. Open Notes, ⌘N, type whatever you want, and switch back to whatever you were doing before. There is no Save command. There are no files. And while a few dozen text files on your desktop starts to look messy, and makes individual items hard to find, you can stash thousands of notes in Apple Notes and they just organize themselves into a simple list, sorted, by default, by most recently modified. You can create folders and assign tags in Notes, but you don’t need to. Don’t make busy work for yourself. And with iCloud, you get fast reliable syncing of all your notes to all of your other Apple devices: iPhone, iPad, Vision Pro, even your Watch now. Sometimes you just want to stick with what you’re used to. I get it. I am, very much, a creature of habit. And TextEdit is comforting for its simplicity, reliability, and unchanging consistency spanning literally decades. But there’s no question in my mind that nearly everyone using TextEdit as a personal notes system would be better served — and happier, once they adjust to the change — by switching to Apple Notes. ★
Read more →

Federico Viticci on Clawdbot

Federico Viticci, writing at MacStories: If this intro just gave you whiplash, imagine my reaction when I first started playing around with Clawdbot, the incredible open-source project by Peter Steinberger (a name that should be familiar to longtime MacStories readers) that’s become very popular in certain AI communities over the past few weeks. I kept seeing Clawdbot being mentioned by people I follow; eventually, I gave in to peer pressure, followed the instructions provided by the funny crustacean mascot on the app’s website, installed Clawdbot on my new M4 Mac mini (which is not my main production machine), and connected it to Telegram. To say that Clawdbot has fundamentally altered my perspective of what it means to have an intelligent, personal AI assistant in 2026 would be an understatement. I’ve been playing around with Clawdbot so much, I’ve burned through 180 million tokens on the Anthropic API ( yikes ), and I’ve had fewer and fewer conversations with the “regular” Claude and ChatGPT apps in the process. Don’t get me wrong: Clawdbot is a nerdy project, a tinkerer’s laboratory that is not poised to overtake the popularity of consumer LLMs any time soon. Still, Clawdbot points at a fascinating future for digital assistants, and it’s exactly the kind of bleeding-edge project that MacStories readers will appreciate. Clawdbot can be overwhelming at first, so I’ll try my best to explain what it is and why it’s so exciting and fun to play around with. Overwhelming indeed. Clawdbot is undeniably impressive, and interest in it is skyrocketing. But because of its complexity and scope, it’s one of those things where all the excitement is being registered by people who already understand it. This essay from Viticci is the first thing I’ve seen that really helped me start to understand it. ★
Read more →

BellSoft bets Java expertise can beat hardened container wave

The hardened container market has been heating up with venture money and startups, but Java platform provider BellSoft thinks its eight years of building Java runtimes gives it something others don’t have: expertise in what’s actually running inside those secured containers. The company used the KubeCon conference in Atlanta last November to launch its BellSoft Hardened Images, betting it can stand out in a space where Chainguard pioneered the approach, and startups are now piling in. BellSoft’s angle is that it’s not just wrapping containers in security — it’s optimizing the Java workloads themselves. “The market for containers is emerging,” Alexander Belokrylov, co-founder and CEO of BellSoft, tells The New Stack in an interview. “I see how much money venture investors put in, and it looks like they feel that it has potential.” Belokrylov says the problem BellSoft is addressing is real. When development teams use base images, they often inherit a large attack surface, including unnecessary packages, shells, compilers, package managers, and unused libraries that may contain known vulnerabilities that haven’t been addressed, according to Janet Costello Worthington, a Forrester Research analyst who covers security. “This can lead to patching chaos, emergency rebuilds, or even production failures,” she says. “Hardened containers strip away these unnecessary components, reducing the risk of exploits and simplifying container management.” This all comes as Java faces a particular vulnerability problem: 44% of Java services contain known exploited vulnerabilities, compared to 5% for Go and just 2% for other languages. Typical container images carry 600 known vulnerabilities, nearly half of them years old. Two-thirds of organizations had a container security incident in the past year. What sets BellSoft apart BellSoft argues its differentiator is not just building secure containers — it’s understanding what goes inside them. The company ranks among the top five OpenJDK contributors. “Our differentiator is a deep technical expertise in the technologies we provide,” Belokrylov says. “We are not just the experts in building software; we’re experts in these kinds of projects.” That expertise started with Alpaquita Linux three years ago, an Alpine-like OS that began as a Java optimization project. “Originally, our idea was to optimize Linux to run Java workloads,” Belokrylov said. “However, it appeared that Linux optimized for Java workloads, optimized pretty much for everything.” Now BellSoft supports hardened images for .NET, C/C++, JavaScript, Python, and Go — all with near-zero common vulnerabilities and exposures (CVEs) and technical support. The company claims 95% fewer vulnerabilities than standard Java images and up to 30% resource savings with its Liberica JDK Lite. According to Costello Worthington, vendors that provide hardened container images deliver value by addressing key security and operational challenges. “These images come with less bloat, fewer inherited vulnerabilities, secure configuration defaults, and a smaller attack surface,” she says. Hardened images also offer essential transparency through provenance, attestations, and software bills of materials that detail what’s inside. Crowded field At KubeCon, Belokrylov says he saw plenty of competition. Chainguard has done “a very good job” pioneering hardened containers, but new players are emerging. “There were a number of startups who were making more or less the same, however they are making that from scratch,” he says. Dan Lorenc, CEO of Chainguard, acknowledges the sudden rush into the space. “It’s kind of baffling to watch, in some ways, how crowded the space has gotten in the last year,” he says in an interview with The New Stack. “We started doing it three years ago now, because there was clearly a need.” But Lorenc sees the proliferation of hardened container offerings as a symptom of a deeper issue. “The software supply chain is broken, and the recent explosion of hardened container offerings is the industry’s reaction,” he writes in an article in The New Stack. “The industry has responded by tightening inspection at the end of the assembly line (more checks, more scanners) while largely ignoring how the parts get sourced, assembled, and verified upstream.” In the article, Lorenc also writes, “The real issue is about trusting where software comes from, and why building open source software directly from source is the only way to secure the entire software supply chain.” The hardened container market now includes, in addition to BellSoft, established players like Chainguard, Docker, Red Hat, VMware; cloud providers like AWS, Azure, and Google Cloud Platform; startups like RapidFort, Wiz, Edera, Lineaje, Minimus; and others. The opportunity exists because enterprises now run security scanners on everything. “They are not blind now when they’re accepting software,” Belokrylov says. “They’re asking vendors to provide them the software with the limited number of CVEs.” BellSoft wants to handle that base layer. “The idea is here that there is a vendor like BellSoft who actually took care of the significant part of the software delivery package, like base images, and keeps them up to date and zero CVEs,” Belokrylov says. Developers can focus on their applications while BellSoft maintains the foundation, he says. For enterprises, Costello Worthington notes that customers who leverage hardened container images often find it easier to meet compliance requirements and streamline the process for achieving FedRamp authorization. “Providing development teams with a curated baseline of images ensures development can focus on roadmap features and functionality for the business, while making it easier to meet compliance requirements, reduce vulnerabilities, and accelerate development velocity,” she says. Technical approach The hardened images strip out package managers and nonessential components with locked configurations that can’t be changed at runtime. Alpaquita Linux supports both musl and glibc, letting teams migrate without rewriting code. Unlike competitors waiting for upstream patches, BellSoft writes its own when needed — a capability that comes from actively contributing to OpenJDK and GraalVM. The company also sells Liberica JDK Performance Edition, which backports the modern JVM from Java 25 into older versions. “Applications written for JDK 8 API, specifically, they perform as if they were migrated to the most modern Java version without any line of code change,” Belokrylov says. “That’s a killer feature for companies who still run Java 8 applications, specifically in the cloud.” Tiers available BellSoft offers a free Community tier with hardened containers for JDK 21 and 25+. A Standard tier covers all JDK versions plus GraalVM, Go, Python, C, and Alpaquita base with a 7-day CVE remediation SLA. Premium adds support and performance consulting. The post BellSoft bets Java expertise can beat hardened container wave appeared first on The New Stack.
Read more →

AI DevOps vs. SRE agents: Compare AI incident response tools

If you’ve seen a new crop of data about ops lately, you may have noticed this new category coming up: AI agents that promise to take charge of incident intervention, diagnose the root cause, and even solve problems themselves. AWS announced one. Microsoft has one. A dozen startups are creating them. And the terminology is, to put it mildly, an inconsistent affair: AI DevOps engineer, AI site reliability engineering (SRE) agent, AIOps platform. Are these the same thing? Different things? Marketing fluff? I’ve spent time digging into what these systems actually do, how they differ, and what matters when you’re evaluating them. Here’s what I’ve learned. Why this category exists now Let us start with this: Ops teams are drowning. The complexity of microservices architectures has skyrocketed. For example, one user’s request could touch 15 services in three clouds. When it is 2 a.m. and something breaks, you are looking at dashboards from six different tools and matching logs, metrics, and traces trying to connect them all while the Slack explosion is giving “Is the site down?” messages. Traditional monitoring lets you see what is really happening. It doesn’t tell you why or how to act. That gap — between sight and action — is where AI operations agents dwell. The pitch is clear: Rather than spending 45 minutes figuring out why things went wrong, an AI agent builds the connection in minutes, uncovers the root cause of what possibly went wrong, and proposes a fix. Some take it a step beyond and implement the fix with your consent. What these agents actually do Strip away the marketing, and AI DevOps engineers share a playbook. They connect to your observability stack — Datadog, Splunk, CloudWatch, and whatever you’re running — and consume telemetry. They integrate with your CI/CD pipelines and source control software, so they know which code has been shipped just now. They hook into ticketing systems like PagerDuty or ServiceNow to see event history. If something goes wrong, they correlate signals in these systems. So they create a timeline: This deployment happened, latency started to rise, errors started to occur, then this downstream service began to fail. They plot your infrastructure topologies to interpret service dependencies and understand failures down the call chain. The better ones learn from mistakes in the past. They identify patterns: “The error signature we saw last time, the root cause was a misconfigured environment variable.” They bring that context to the surface so you can fix issues sooner. Some agents remain advisory — they investigate and recommend action items, but a human pulls the trigger. Others push toward automation, executing remediation workflows with appropriate guardrails. AI DevOps engineer vs. AI SRE agent The main distinction is marketing and the scope of work. SRE is all about reliability, availability, and error budgets. DevOps looks at the broader delivery life cycle. In reality, most AI operations agents cover both. They manage incidents — SRE territory — and improve pipelines or build Infrastructure as Code (IaC) — DevOps territory. The underlying tech is the same: machine learning (ML) models trained on operational data, natural language, and the interfaces and integration frameworks that plug into your toolchain. Don’t worry about what the vendor calls its product. Evaluate what it actually does. How cloud providers are responding to AI ops AWS DevOps Agent, which launched in preview late last year, is worth understanding because it illustrates how cloud providers think about this problem. AWS built an agent that correlates data across CloudWatch, third-party monitoring tools, and CI/CD systems. It maps your infrastructure topology, tracks deployments, and generates recommendations when incidents occur. It integrates with ticketing systems to respond automatically when alerts fire. The agent is genuinely useful for investigation. It understands AWS resources deeply — EC2 instances, Lambda functions, EKS clusters, the whole catalog — and can trace relationships between them. There is a difference: AWS thinks in terms of resources, not applications. It knows you have a Kubernetes cluster with certain pods. It doesn’t inherently know that those pods constitute your checkout service, which is distinct from your inventory service, which has different owners and different risk tolerances. This resource-centric view shapes what the agent can safely do. Without guaranteed application boundaries, automated remediation carries risk. What if scaling one service cascades into another? What if a rollback affects components you didn’t intend to touch? That’s why AWS DevOps Agent emphasizes investigation and recommendation over automated action. It’s a deliberate design choice, not a limitation. Microsoft’s Azure SRE Agent takes a similar approach. The true differentiator: Application context Context. Here’s what I’ve come to think matters most: the degree of abstraction with which an agent operates. Those acting at the infrastructure level know what resources they own and the relationships among them. They’re good at responding to “What’s happening,” but very careful of “What should we do about it?” Some platforms also offer explicit application boundaries as a first-class concept. If an agent knows these containers, this database, and these queues, each forming one application with defined ownership, acting on them will be easier; it can more readily scope actions. Rollbacks remain within safe limits. Scale decisions don’t bleed beyond unrelated services. This accounts for the range from advisory to automated. Automation is dangerous without a context (the other way around). It creates clear boundaries and allows agents to think on their feet. What engineers should consider when evaluating agents If you’re evaluating AI operations agents, here’s what I’d think about: Start with investigation, not automation. Let the agent prove it understands your environment before you give it permissions to change anything. Build trust incrementally. Context quality matters enormously. These agents are only as good as the data and structure they have access to. Well-tagged resources, clear service ownership, and explicit application boundaries make agents dramatically more effective. Integration depth varies wildly. Some agents have deep, bidirectional integrations with popular tools. Others have shallow connections that limit what they can see and do. Ask hard questions about how an agent works with your specific stack. This doesn’t replace expertise. AI agents amplify engineering capability. They don’t substitute for understanding your systems, making judgment calls, or designing for reliability. Treat them as force multipliers, not replacements. Where this is headed The category is maturing fast. There is competition from cloud providers, observability vendors, and focused startups: The result is rapid innovation and falling prices. Yes, the opportunity for engineering teams is real. A good agent reduces the average time to resolution to fix a problem, reduces the on-call load, and allows engineers to focus on building resilient systems rather than fighting fires. But the hype is also real. Assess agents not based on their slide decks, but on how they actually behave in context, in your environment. The teams that experiment with it thoughtfully now will be most well-placed as these tools become the standard part of the operations stack. At DuploCloud, we’re actively building AI agents designed to execute real DevOps and cloud operations workflows. In our sandbox, you can interact with purpose-built agents that operate across cloud infrastructure, Kubernetes management, and observability — running inside real environments to diagnose issues, apply changes, and automate day-to-day operations. The post AI DevOps vs. SRE agents: Compare AI incident response tools appeared first on The New Stack.
Read more →

Async Rust: Pinning demystified

This is the second of a four-part series. Read Part 1: How Rust does Async differently (and why it matters) In the previous part of this series, we explored the “pull-based” model of Rust’s asynchronous engine. We saw how the compiler transforms async functions into lazy state machines that only make progress when polled by an executor. However, if you looked closely at the poll method signature we implemented for our CountdownFuture, you might have noticed a peculiar wrapper around self: fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> In Part I, we focused on the logic — how a future decides whether it is Ready or Pending. In this post, we are shifting our focus to physics. Pin is often the most intimidating topic for those learning async Rust, yet it is the critical “secret sauce” that makes zero-cost futures possible. It ensures that when our state machine is halfway through an operation, it doesn’t suddenly move to a new location in memory and break every internal reference it holds. 1. The problem: Moving the unmovable In Rust, every type is movable by default. Whether you pass a variable into a function or assign it to a new name, Rust performs a bitwise copy (memcpy). For 99% of types, this is efficient and safe. But for self-referential structs, it is a disaster. Imagine a struct where one field is a pointer to another field within itself: struct SelfReferential { data: String, pointer_to_data: *const String, } If you move this struct to a new memory address, data moves with it. However, pointer_to_data still contains theold memory address. It is now a dangling pointer. Accessing it will cause undefined behavior. The async connection: A concrete example To understand why this matters for async Rust, we have to look at how the compiler treats an .await point. When you write an async function, the compiler transforms it into a struct that stores the “captured” state of your function. Consider this innocent-looking code: async fn process_data() { let val = String::from("Hello"); let val_ref = &val; // A reference to a local variable some_async_operation().await; // The function "pauses" here println!("{}", val_ref); // The reference is used after the pause } The ‘lowered’ state machine Internally, the compiler generates a struct to hold those variables so they survive while the function is paused. It looks roughly like this: struct ProcessDataFuture { val: String, val_ref: *const String, // Points to 'val' inside this same struct! state: State, } The memory disaster (the ‘move’) This is where the physical location of your data becomes critical. Let’s look at what happens in memory if we move this future after it has started. Before move (At the .await point): The future is located at address0x1000. val (the string) is at address 0x1008. val_ref (the pointer) correctly stores the value 0x1008. The move: You move the future (perhaps by pushing it into a Vec or moving it to another thread). The future is now at address 0x2000. val has moved with the struct and is now at address 0x2008. The crash: val_ref still stores the value 0x1008. When the executor resumes the future and tries to use val_ref, it reaches back to address 0x1008, which is now garbage memory. Boom. How Pin saves the day When the executor polls this future, it doesn’t just take a normal reference; it requires a Pin<&mut Self>. By pinning the ProcessDataFuture, we are effectively telling the CPU: “This struct is now anchored at address 0x1000. It is illegal to move it until it is finished.“ Because the struct is guaranteed to stay at 0x1000, the internal pointer val_ref (pointing to 0x1008) remains valid for the entire life of the operation. This is the only way Rust can safely allow you to have references to local variables across .await points. 2. What Pin<P> actually is A common misconception is that Pin is a new pointer type. It isn’t. Pin is a wrapper around an existing pointer (like &mut T or Box<T>). It acts as a legal contract with the compiler: “The data pointed to by this pointer will never be moved again until its drop method is called.” The anatomy: You can move the Pin wrapper itself (such as swapping two Pin<Box<T>> variables), but you cannot move the T sitting inside it. Stability: Think of it like a foundation. You can’t move a house once the foundation is poured; you can only demolish it (Drop). 3. The Unpin marker trait Why does Pin<&mut i32> still allow you to move the integer? This is because of the Unpin trait. Auto-implemented: Almost every type in Rust (i32, String, Vec) automatically implements Unpin. These types are “safe” to move even if they are wrapped in a Pin. The role of !Unpin: Types that are not safe to move (like self-referential structs or compiler-generated futures) are marked as !Unpin. The distinction: If T: Unpin, then Pin<P<T>> behaves exactly like a normal pointer. The pinning logic only “activates” when the underlying type is !Unpin. 4. Stack pinning vs. heap pinning You have two main ways to anchor a value in memory, each with different trade-offs: Heap pinning (Box::pin) This is the “safe and easy” route. When you use Box::pin(value), the data is moved onto the heap. Since heap allocations have a stable address for their entire lifetime, pinning is trivial. Pros: Easy to use, no unsafe required. Cons: Requires a heap allocation (performance cost). Stack pinning (pin!) You can pin a value to the current stack frame using the std::pin::pin! macro. Pros: Zero-cost, no heap allocation. Cons: The pinned value cannot outlive the current function. It is much more restrictive than heap pinning. 5. Modern tooling: The pin-project crate Manually accessing fields of a pinned struct (called Pin Projection) is notoriously difficult to do safely because it often requires unsafe code. The industry standard is to use the pin-project crate. It allows you to safely “project” a pinned reference from a struct down to its individual fields without writing a single line of unsafe code: Practical example: The retryable future Here is how you implement a wrapper that retries a failing future up to a certain limit. Note how #[pin] allows us to safely handle the inner future even if it’s !Unpin. use std::pin::Pin; use std::task::{Context, Poll}; use std::future::Future; use pin_project::pin_project; #[pin_project] pub struct Retry<F, Fut> { // A factory function to create a new instance of the future for each retry factory: F, // The current future attempt we are polling #[pin] active_fut: Fut, retries_left: usize, } impl<F, Fut, T, E> Future for Retry<F, Fut> where F: Fn() -> Fut, Fut: Future<Output = Result<T, E>>, { type Output = Result<T, E>; fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> { let mut this = self.project(); match this.active_fut.as_mut().poll(cx) { // If it succeeded, or we're out of retries, return the result Poll::Ready(Ok(val)) => Poll::Ready(Ok(val)), Poll::Ready(Err(e)) => { if *this.retries_left > 0 { *this.retries_left -= 1; println!("Future failed. Retries remaining: {}", this.retries_left); // Reset the state: Create a new future and poll it let new_fut = (this.factory)(); this.active_fut.set(new_fut); // We must poll again to register the new waker cx.waker().wake_by_ref(); Poll::Pending } else { Poll::Ready(Err(e)) } } Poll::Pending => Poll::Pending, } } } 6. The Pin cheat sheet Type Property Wrapped in Pin? Can it move? Unpin (e.g. i32) No Yes Unpin (e.g. i32) Yes Yes (Pin is ignored) !Unpin (Self-ref) No Yes (Danger! ⚠️) !Unpin (Self-ref) Yes No (Safe ✅) Conclusion Pin is the invisible anchor that allows Rust’s async engine to be both safe and zero-cost. While it feels like a complex academic concept at first, it boils down to one simple rule: If data points to itself, it must stay put. By understanding the relationship between Pin, Unpin, and memory addresses, you are now equipped to handle complex async state machines and custom futures with confidence. What’s next: Building the engine Now that we understand the logic (Part I: state machines) and the physics (Part II: Pinning), it’s time to actually run our code. A future is just a dormant piece of data sitting in memory; it doesn’t do anything on its own. It needs an engine to drive it. In Part III, we will build a custom async runtime from scratch. We will explore: The executor: The loop that orchestrates polling. The waker: How a future tells the executor, “I’m ready to try again!” without wasting CPU cycles. The reactor: How we bridge the gap between OS-level events (like network I/O) and our Rust state machines. The post Async Rust: Pinning demystified appeared first on The New Stack.
Read more →

Power agentic workflows in your terminal with GitHub Copilot CLI

Since GitHub Copilot CLI launched in public preview in September 2025, we’ve been shipping frequent regular updates and advancements. Below, we’ll show you what makes Copilot CLI so special, why it’s great to have an agentic AI assistant right in your terminal, and how we’re building the Copilot CLI to connect more broadly to the rest of the GitHub Copilot ecosystem. Note: This blog is based on a GitHub Universe 2025 presentation. Watch below to see the functionality in action. 👇 Bringing the CLI to where you work If you use GitHub Copilot in VS Code or in a similar IDE, consider how often you spend your entire working day in the IDE, trying to avoid doing anything in any other working environment. We kept this thought top of mind when we conceptualized the GitHub Copilot CLI. Developers spend time using ssh to connect to servers, debug things in containers, triage issues on github.com, manage CI/CD pipelines, and write deployment scripts. There’s a lot of work that doesn’t neatly map into an individual IDE or even a multipurpose code editor like VS Code. To make sure that we brought the GitHub CLI to developers where they already are, it made sense to go through the terminal. After all, the terminal transcends all the different applications on your computer and, in the right hands, is where you can accomplish any task with fine-grained control. Bringing GitHub Copilot into the CLI and giving it access to the broader GitHub ecosystem lets you spend more time getting your work done, and less time hunting down man pages and scouring through documentation to learn how to do something. Showcasing the GitHub CLI functionality Often, the first step with a project is getting up to speed on it. Let’s consider an example where you’re filling in for a friend on a project, but you don’t know anything about it—you don’t know the codebase, the language, or even the framework. You’ve received a request to update a feedback form because the UI elements are not laid out correctly. Specifically, the Submit Feedback button overlaps the form itself, obscuring some fields. Whoever submitted the bug included a screenshot showing the UI error. To get started, you can launch the GitHub CLI and ask it to clone the repository. Clone the feedback repo and set us up to run it After sending this prompt, Copilot will get you everything you need: It will reference the documentation associated with the repository and figure out any dependencies you need in order to successfully run it. It’s a fast way to get started, even if you’re not familiar with the dependencies required. Copilot will prompt you before running any commands to make sure that it has permission to do so. It will tell you what it’s doing and make sure that you authorize any commands before it runs them. Now let’s say that your repository is set up and you go to run the server, but you receive an error that the port is already in use. This can be a workflow killer. You know that there are commands you can run in the terminal to identify the process using the port and safely shut it down, but you might not remember the exact syntax to do so. To make this much easier, you can just hand the task over to Copilot. What is using port 3000? Without you needing to look up the commands, Copilot can determine the PID using the port. You can then either kill the process yourself or hand that task over to Copilot so you can focus on other tasks. Find and kill the process on port 3000 Continuing with our example, you now have the repository up and running and can verify the error with the Submit Feedback button. However, you don’t want to look through all of the code files to try and find what the bug might be. Why not have Copilot take a look first and see if it can identify any obvious issues? Copilot can analyze images, so you can use the image supplied in the bug report. Upload the screenshot showing the error to the repository, and ask Copilot if it has any ideas on how to fix the bug. Fix the big shown in @FIX-THIS.PNG Copilot will attempt to find and fix the issue, supplying a list of suggested changes. You can then review the changes and decide whether or not to have Copilot automatically apply the fixes. And we’re able to do all of this in the terminal thanks to the GitHub CLI. However, before uploading these changes to the repository, the team has very strict accessibility requirements. You might not be familiar with what these are, but in this example, the team has a custom agent that defines them. It has all the right MCP tools to check on the guardrails, so you can leverage the agent to do an accessibility review of any proposed changes. /agent This command provides a list of available custom agents, so you can select the appropriate one you want to use. Once you select the appropriate agent, simply ask it to look over the proposed changes. Review our changes This prompt sets the coding agent to work, looking at your changes. If it finds any issues, it will let you know and suggest updates to make sure your changes are aligned with its instructions. This can be immensely powerful with the appropriate agents to leverage to provide checks on your code. Finally, let’s say you want to know if there are any open issues that map to the work that you’ve done, but you don’t want to manually search through all of the open issues. Luckily, Copilot CLI ships with the GitHub MCP server, so you can look up anything on the GitHub repository without needing to manually go to github.com. Are there any open issues that map to the work we're doing? The GitHub MCP server will then go through and search through all of the issues and identify any that might match the work that you’ve addressed. If it pulls up issues that aren’t completely resolved by the work that you’ve done, you can still delegate this work to a coding agent so that you can continue working on whatever you’re focused on. /delegate Finish fixing the issue outlined in #1 and use the playright MCP server to ensure that it's fixed The /delegate command dispatches a coding agent to work on the task for you in the background while you turn your attention to other areas. It will open a pull request for future work that the coding agent performs. This is identical to the standard Copilot coding agent workflow—just started through GitHub Copilot CLI. Headless operation for scripting and automation GitHub Copilot CLI has even more functionality than what we’ve previously showcased. You can now perform tasks headlessly in the Copilot CLI. Remember the example where we talked about identifying and killing the process running on port 3000? You could do this through the CLI with the following command. copilot --allow-all-tools -p "Kill the process using port 3000" Copilot will then use the appropriate commands to identify and kill that process. While this is a simple example, you can think of more complex scenarios where you could hook this up into a script or actions workflow and reuse it over and over again. Note that this included the flag –allow-all-tools, which is probably not something you want to include in an actual environment unless you’re running in a container. Luckily, we provide several flags that you can pass to only allow access to certain directories and tools. You can even restrict Copilot from using specific commands, so you can guarantee that a human is always involved, such as with pushing up to a repository. To see a list of possible flags, run the following command. copilot --help You can authenticate interactively with a login command or by using a personal access token. This way, you can use this with automations. We’re also actively working on more authentication methods that are more enterprise friendly. Trying the Copilot CLI yourself We’re constantly shipping updates and are always looking for feedback from our users. We have several open issues and are tracking the items that customers want to see. If you want to see what we’re working on and provide feedback, check out our public GitHub Copilot CLI repository. And if you want to get started, it’s incredibly easy. It’s available for Windows (both WSL and natively in PowerShell), Mac OS, and Linux. We provide several platform-specific ways to install the CLI in the Copilot CLI README. Give it a try and come join the conversation on our public repository to help us build the best terminal-based AI system we possibly can. We look forward to hearing your feedback! Get started with GitHub Copilot CLI > The post Power agentic workflows in your terminal with GitHub Copilot CLI appeared first on The GitHub Blog.
Read more →

Anthropic extends MCP with a UI framework

With the MCP protocol, Anthropic created the de facto standard for AI models and agents to talk to third-party applications. After donating the MCP protocol to the Agentic AI Foundation last December, Anthropic today released a major new open extension to MCP that will allow MCP servers to serve up an interactive app-like experience right within the chat interface. Anthropic, of course, is building this feature right into the web and desktop experiences for Claude. It’s worth stressing, however, that this is an open protocol, so any other chatbot provider can adopt this protocol, and any third-party service will be able to build these apps. Already, support for MCP Apps is available in Goose, Visual Studio Code (for Insiders), and, later this week, ChatGPT from Anthropic competitor OpenAI. Some of Anthropic’s early partners include the likes of Amplitude, Asana, Box, Canva, Clay, Figma, Hex, monday.com, and Slack. With the Box MCP App, for example, users will be able to search for files and preview documents inline in the chat experience — and then ask questions about those documents, too. With the Slack app, meanwhile, users can use the AI model to write and edit message drafts and then post them to Slack. Among other things, it’s the MCP Apps framework that allows for the direct editing of these messages right in Claude. The Slack MCP app (image credit: Slack). “Enterprises need more from AI than powerful models. They also need a reliable way for those models to operate inside real business environments. By partnering with Anthropic, we are bringing Salesforce directly into our customers’ flow of work and providing the execution layer with context, data, governance, and trust,” says Nick Johnston, SVP of Strategic Tech Partnerships at Salesforce in today’s announcement. “That’s what powers the Agentic Enterprise.” Soon, Slack owner Salesforce will also bring its Agentforce, Data 360 and Customer 360 apps to Claude. The Asana MCP app (credit: Asana). Some of the typical scenarios for using MCP Apps, which Anthropic first proposed in November, include interactive data exploration using dashboards, configuration wizards, document reviews and real-time monitoring. At its core MCP Apps rely on tools that supply user interface metadata and the user interface resources (HTML and JavaScript) to render them. Building MCP Apps The core primitives for defining MCP apps (credit: Anthropic). Bringing this interactive UI experience to Claude and other chat-centric AI tools feels like a logical next step. Chat is, for better or worse, still the default way to interact with AI models, but for a while now, it has felt quite limited. Anthropic isn’t the first one to think of this, of course. With its Apps SDK, OpenAI offers a somewhat similar framework, which also uses MCP at its core. Anthropic notes that both the OpenAI Apps SDK and the open-source MCP-UI project (created by Ido Salomon and Liad Yosef) pioneered many of these patterns. “The projects proved that UI resources can and do fit naturally within the MCP ecosystem, with enterprises of all sizes adopting both the OpenAI and MCP-UI SDKs for production applications,” the Anthropic team writes. And for the foreseeable future, developers who wrote MCP-UI apps will be able to continue to do so. “MCP Apps builds upon the foundations of MCP-UI and the ChatGPT Apps SDK to give people a rich, visually interactive experience,” says Nick Cooper, Member of Technical Staff, OpenAI. “We’re proud to support this new open standard and look forward to seeing what developers build with it as we grow the selection of apps available in ChatGPT.” On the security front, Anthropic notes that it implemented a number of guardrails to ensure the third-party code you are running on your MCP host can break out of its sandbox. These include sandboxed iframes with restricted permissions, the ability of hosts to review the HTML content before rendering, auditable UI-to-host messages, and the fact that users have to give explicit approval for UI-initiated tool calls. The post Anthropic extends MCP with a UI framework appeared first on The New Stack.
Read more →

Cisco is using eBPF to rethink firewalls, vulnerability mitigation

Networking giant Cisco purchased Isovalent in 2024 to get in on the cloud native action. In our cloud native community, Isovalent was primarily known for Cilium, an Extended Berkeley Packet Filter (eBPF) overlay network that worked well for Kubernetes environments, namely by replacing IP tables with in-kernel traffic routing by eBPF. The company also built Tetragon, a vulnerability mitigation platform that Cisco has already embedded into its own smart switch software. Today, Cisco is one of the chief purveyors of network infrastructure, gear such as routers and switches, aimed primarily at enterprises. “They liked what we were doing, and they saw value and continue to see value in the solutions that we have for the Kubernetes world,” says Liz Rice, Isovalent chief open source officer, in an interview with The New Stack. “Cisco has this enormous global footprint across traditional networking, so being able to bridge those two things together is really nice.” Cisco runs its switches on Linux, which, like any software, has its share of vulnerabilities. Powering down a fleet of them just to apply a patch across each box isn’t ideal, however. Tetragon allows users to patch, or even upgrade, these switches while they continue to run. These eBPF technologies “are incredibly foundational to where Cisco wants to go from an existing product perspective, but also from a future perspective,” says Thomas Graf, Isovalent chief technology officer and a co-creator of Cilium, also in the TNS interview. “I think they’re learning that cloud native is not just Kubernetes, but a concept of how infrastructure will be done in general in the future, and that it will go beyond Kubernetes and containers,” Graf says. eBPF provides a way to interject programmability directly into the Linux kernel, allowing the kernel to make decisions about incoming and outgoing traffic even before it gets to the application. “With eBPF, we can attach miniature firewalls everywhere in the operating system or in the application code as well,” Graf says. “It’s a completely new era of firewalling that is not based on choke point firewalls that sit somewhere physically in the network.” Faster patching Today’s elaborate processes of vulnerability mitigation could change dramatically with eBPF, Graf says. In today’s world, if there is a bug in your software that can be exploited by a malicious party, it must be patched. This is “easier said than done,” Graf says, noting that most organizations have long lists of patches they need to apply, which are usually ranked by severity to set the priority of how quickly they should be applied. And if they run on dedicated hardware, such as an Internet of Things (IoT) device, the underlying operating systems will need patches as well. eBPF can either mitigate the attack itself or drastically reduce the blast area by blocking the specific action that the malicious user wants to take with the faulty software. The user still must patch, but it is not as urgent, and malicious actions can be blocked by eBPF in the meantime. Originally called the Berkeley Packet Filter, the technology originally served as an HTTP packet filter for the Berkeley Software Distribution (BSD). It has since been expanded into a virtual machine (VM) that can execute sandbox-secured code. Since its inclusion in the Linux kernel a decade ago, the Linux-based eBPF has found widespread adoption, particularly for observability, security, and compliance tools that benefit from its programmable in-line speed to analyze and filter packets without the need for cumbersome modules or dangerous kernel modifications. Cisco’s application Cisco has integrated eBPF into its Hypershield technology, available in its Cisco Nexus 9300 Series Smart Switches. It addresses the changing patterns in data center traffic. “Traditional security creates chokepoints. You route traffic through firewalls, IPS appliances, or virtual security functions. This made sense when your data center had clear boundaries and most traffic crossed them,” writes analyst Robb Boyd in a blog post. “But modern infrastructure doesn’t work that way anymore.” For one, a lot of network traffic used to run north-south, meaning between the server and the outside world. Today, especially with AI traffic and distributed Kubernetes deployments, a lot of traffic goes east-west, or across an internal network. Using eBPF, Hypershield adds visibility into each endpoint, such as VM Kubernetes pods, to get kernel-level visibility and control. “This agent sees everything: network packets, file operations, process behavior, system calls,” Boyd writes. eBPF as control plane The platform only hints at the possibilities down the road. One of Cisco’s goals with eBPF is to move away from centralized firewalls and towards distributed firewalls for each device and even each program. Patching an entire fleet of switches means each one must be rebooted individually, which is an expensive operation, one that would preferably be done during a scheduled maintenance window. Perhaps the reboots could even be spread out so that no downtime would be incurred at all. “You want to be able to pick your time of choice and not have the timeline be dictated by the vulnerability being disclosed,” Graf says. In fact, this is one of the reasons that Facebook/Meta got involved in eBPF. That company runs thousands of Linux servers, and to patch them all at once during a time of a critical vulnerability would be nearly impossible. “So they were very interested in essentially investing into eBPF to mitigate zero-day attacks where the entire Facebook server fleet was vulnerable,” he says. All attacks leverage an interface that the OS provides, either an API call or a system call. Think of eBPF as a miniature firewall, one located in working memory that can filter out specific actions. “eBPF can hook into all of these interfaces and to essentially be in the middle of whatever calls the interface and what uses the interface and can then filter out” any malicious activity, Graf says. This would work not only for OSes, but for any application on the network as well. Think of the severe Nginx vulnerability unearthed last March (CVE-2025-1974). This vulnerability hit a lot of Kubernetes deployments, whose management teams had to figure out where they were using the Nginx software. An eBPF deployment within all the OSes could take care of the problem once: If you are running Nginx, apply this filter. eBPF’s next frontier: The laptop While eBPF may work to secure Linux servers, what about desktop computers? The ongoing work on bringing eBPF to Microsoft Windows is nearly completed, Graf says. This is an entirely new market for eBPF, he notes. Linux is dominant in the server market, but Windows rules the endpoint market for laptops, desktop computers, and small devices. “I think now we can apply eBPF For security purposes, not just for the workload and server side, but also for your laptop,” Graf says. eBPF excels at understanding programs as they run. It can operate at the operating system level, without damaging it, he says. He points to how eBPF is already being used in Google Android-powered devices. If you want to know how much network bandwidth Android is using, eBPF is behind that. Developers running agents and models on their laptops will need to be protected, and here is where eBPF could come into play. Applications that run on your behalf, like AI agents, that do work under your user account need a new form of security. Another challenge in the future will be connecting the identities of machines with those of users. Just because someone has your password shouldn’t mean that they should get access to your company’s network. “It’s a network of agents and services that are connected together. So we have to carry the identity forward all the way to where you actually access the sensitive data,” Graf says. The post Cisco is using eBPF to rethink firewalls, vulnerability mitigation appeared first on The New Stack.
Read more →

How open standards enable zero trust on commodity hardware

Confidential computing has always held a certain promise. The idea that workloads could process sensitive data while remaining isolated even from the infrastructure that runs them has reshaped the way many enterprise security teams think about trust. For years, we have accepted that data should be encrypted at rest and in transit, but data in use has remained exposed to the platform beneath it. Confidential computing proposes to close that gap. What has slowed adoption is not a lack of interest but a reliance on specialized and expensive hardware. Trusted execution environments demand specific CPUs, constrained instance types, and operational trade-offs that place them out of reach for many real-world deployments. The result is a growing mismatch between the threat models enterprises care about and the tools they can practically deploy. At the same time, something important is happening in open source. A set of identity and isolation primitives is quietly maturing into an infrastructure layer that looks a lot like the public key infrastructure that underpins the modern web. Instead of encrypting sessions between browsers and servers, these systems establish cryptographic identities for workloads themselves. Let’s look at how those building blocks come together, why workload identity is becoming central to zero trust architectures, and how systems like Edera use open standards to deliver many of the benefits of confidential computing without requiring new hardware. SPIFFE and the meaning of workload identity To understand where this is going, it helps to define a few terms. Workload identity is the idea that software should be able to prove what it is and where it is running, independent of network location or static credentials. Workload attestation is the process of verifying those properties before granting identity. Zero trust is the assumption that no implicit trust exists based on network position, and that every interaction must be authenticated and authorized. Confidential computing, in its strictest sense, aims to ensure that workloads remain isolated and verifiable even from the host platform. SPIFFE, the Secure Production Identity Framework for Everyone, is a specification that addresses workload identity directly. It defines how workloads are identified, how those identities are represented, and how they can be verified across distributed systems. A SPIFFE ID is a structured identifier bound to a trust domain and a specific workload. It is not a secret and is not tied to an IP address or a long-lived credential. Instead, it becomes meaningful only when paired with a cryptographic document known as an SVID, or SPIFFE Verifiable Identity Document. An SVID binds a SPIFFE ID to a key pair and a signing authority. This allows workloads to authenticate to each other using short-lived credentials that can be rotated automatically. From the perspective of a developer or operator, this looks familiar. It mirrors the waycertificates work on the web, but the subject is a workload rather than a domain name. The important distinction is that SPIFFE does not dictate how trust is established. It defines the interface and the format, leaving attestation to the underlying platform. That flexibility is what makes it so powerful. SPIFFE can sit above cloud-provider metadata, operating-system signals, or, in our case, a hypervisor-rooted trust model. SPIRE as the runtime for trust SPIRE is the reference implementation of the SPIFFE specification. Where SPIFFE defines what workload identity looks like, SPIRE defines how it is issued and managed in practice. It introduces two main components: a SPIRE Server and SPIRE Agents. The SPIRE Server acts as the root of trust. It holds the signing keys for the trust domain and enforces registration policies that define which workloads are allowed to receive which identities. The SPIRE Agent runs on each node and performs two related tasks. First, it proves the identity of the node itself through node attestation. Then it performs workload attestation on behalf of processes running on that node. Node attestation determines whether a machine should be trusted to host workloads in the first place. Workload attestation answers whether a specific process meets the criteria to receive a given identity. Crucially, workloads never carry secrets with them. They request an identity from the local agent at runtime and receive an SVID only if attestation succeeds. Those identities are short-lived and automatically rotated, dramatically reducing the blast radius of compromise. This separation is what allows SPIRE to fit cleanly into zero trust models. Trust is established explicitly, continuously, and based on verifiable properties rather than assumptions about the environment. Combining zones and isolation Edera approaches isolation from a different starting point. Instead of sharing a kernel across workloads, Edera runs applications inside zones that behave like lightweight virtual machines. Each zone has its own kernel and is isolated by a type-1 hypervisor with a small trusted computing base. This removes the shared kernel from the trust boundary and eliminates an entire class of container escape attacks. In this model, zones become the natural unit of trust. A zone is not just a scheduling construct but a security boundary. That makes it an ideal foundation for workload identity. The challenge is proving to a remote party that a workload is actually running inside such a zone. This is where SPIFFE and SPIRE fit. By rooting node attestation in the hypervisor itself, Edera can use the hypervisor as the underlying platform authority. The hypervisor can vouch for the existence and integrity of zones, while standard workload attestation mechanisms operate inside those zones without modification. Key material and sensitive services like the SPIRE Server can themselves run inside hardened zones, further reducing exposure. The result is a system where workloads receive cryptographic identities only if they are running inside verified isolated environments. Data can be encrypted directly to those identities, and policies can be enforced based on where and how code is executing, not just who wrote it. This architecture delivers something subtle but important. It provides remote attestation of isolation properties without relying on specialized hardware enclaves. The guarantees come from strong isolation and verifiable identity rather than opaque hardware features. In practice, this covers a large set of real-world threat models that enterprises care about today. Why this matters now Enterprise security teams are increasingly forced to reason about workloads rather than hosts. Microservices, multitenant clusters, and AI systems that process sensitive data keep eroding traditional boundaries. At the same time, the cost and complexity of hardware-based confidential computing remain a barrier. Open standards like SPIFFE and implementations like SPIRE offer an incremental path forward. They allow organizations to adopt zero trust principles at the workload level, establish cryptographic identities, and build policy around verifiable execution contexts. Systems like Edera show how strong isolation and identity can work together to approximate the benefits of confidential computing using commodity infrastructure. This is not an argument against hardware enclaves. Those technologies will continue to matter for the most sensitive threat models. But it is an argument for paying attention to the broader evolution of workload identity. Just as it quietly became foundational to the web, workload identity is becoming foundational to modern distributed systems. Understanding how attestation, zones, zero trust, and identity intersect will be critical over the next few years. The pieces are already here. The opportunity now is to learn how they fit together and to build systems that can earn trust rather than assume it. The post How open standards enable zero trust on commodity hardware appeared first on The New Stack.
Read more →

A security checklist for your React and Next.js apps

Modern cloud native attacks don’t always rely on a single breakthrough exploit. Instead, threat actors chain together small assumptions, overlooked behaviors, and trusted components in ways defenders least expect. The recent React2Shell vulnerability is a perfect example of this, and the EtherRAT malware shows just how creative adversaries are. For teams that rely on React, the React2Shell vulnerability was a wake-up call. It doesn’t just affect React as a framework; it breaks assumptions many teams rely on in production. In December, it showed us how quickly attackers can use something subtle like server-side rendering (SSR) behaviors for server-side code execution and how difficult it is to spot once it’s live. If you run React or Next.js workloads in production, here’s what CVE-2025-55182 and CVE-2025-66478 actually break, what you should check immediately, and how to identify attackers hiding behind legitimate infrastructure. What React2Shell breaks If you’re unfamiliar, React2Shell is not just another vulnerability you can one-click patch away — the flaw is within the framework itself. React2Shell is a class of vulnerabilities that arise when React applications improperly handle user-controlled input during SSR. Exploitation allows server-side code execution, and the attacks began only hours after the vulnerability was published. Mitigation requires coordinated updates across React server components (RSC), Next.js, and related frameworks, in addition to an evaluation of application data flows. First, once React components render on the server, they no longer execute in a browser sandbox. Instead, they run inside the backend runtime. React is often treated as frontend code and therefore, it’s assumed the server is safe. An attacker can exploit this assumption and inject JavaScript that then runs on the server, not on the browser. At this point, the code runs with the same permissions as the application itself, potentially giving attackers access to cloud credentials, internal APIs, filesystems, and more. Second, client-side sanitization is not the same when rendering moves to server-side. Client-side input validation cannot be relied on to protect server-rendered execution paths. Patterns that are safe in the browser can become risky when evaluated during SSR. Inputs never intended to be executable can be evaluated as code when handled incorrectly by server-rendered components. Finally, server-rendered components are usually assumed to be safe because they originate from application logic rather than user input. React2Shell arises from implicit framework behavior and has little to do with obviously unsafe code. Risk increases in large codebases where SSR patterns are abstracted, reused, and left unchecked. Attackers exploit assumptions because, in this case, they can shift execution from the browser to the server. Once that boundary is crossed, the blast radius expands dramatically. Server-side execution enables credential access, lateral movement, and follow-on payload delivery. Detection requires understanding what the application is doing at runtime and how that behavior can be abused. What you need to check If you have React or Next.js workloads running in production, here’s your checklist: Inventory your environment Identify all services using RSCs, Next.js server components, or SSR. Don’t forget to check the admin panels and dashboards of all internal tools. Ensure framework and package versions are updated against advisory guidance. Audit data flows Is user-controlled input passed into server-rendered components? Are there dynamic rendering paths that evaluate data structures or serialized content? Has data from app logic been reviewed, or is it assumed safe? Review permissions Does this service need outbound internet access? Are credentials and permissions at the minimum requirements? Can containers write to disk or spawn child processes? What happens after exploitation React2Shell was being actively exploited by nation-state threat actors within hours of public disclosure. In one particular campaign investigated by the Sysdig Threat Research Team (TRT), the damage went far beyond smash-and-grab exploitation and financial motivation. A custom remote access trojan (RAT) dubbed EtherRAT was deployed in real-world React2Shell attacks. Instead of using traditional command-and-control (C2) infrastructure, EtherRAT uses something unconventional but resilient: The Ethereum blockchain. Commands are encoded into blockchain transactions and infected systems monitor the chain for instructions. EtherRAT payloads are delivered in stages, allowing the malware to pull down additional capabilities as needed. This approach offers several advantages for attackers: Resilience: Public blockchains are highly available and difficult to disrupt. Stealth: Blockchain traffic can appear legitimate and is increasingly common in enterprise environments, making it difficult to distinguish. Attribution challenges: There’s no central server to seize or sinkhole. This is not commodity malware opportunistically scanning the internet. It’s deliberately crafted and designed to blend into modern operational noise. The takeaway here is: You won’t always see “malware-like” behavior from vulnerability exploitation. EtherRAT indicates subtle runtime deviations in systems that otherwise look healthy, an issue easily overlooked. How to find hidden threats Detecting React2Shell abuse or other hidden threats requires observing what workloads are doing at runtime. You don’t need to know about specific threats to detect threats like these. You just need to know how your environment and applications normally behave. When identified, the following behaviors should be investigated when they’re unexpected or abnormal: Process-level Web server or js processes spawning shells Unexpected child processes Executions at runtime that don’t align with normal app startup behavior Network Outbound connections to unfamiliar external endpoints. Long-lived outbound connections with no relation to the application function. Blockchain-related traffic coming from web services that have no business requirement. File-system Writes to temporary directories from web-facing processes. Creation or execution of new binaries at runtime. What comes next Several broader trends emerge from these recent discoveries: The blurring of client and server boundaries. When JavaScript runs everywhere, blind assumptions become far more costly. Server-side JavaScript is server code. The weaponization of legitimate infrastructure. Blockchains, CI/CD systems, and cloud metadata services are all fair game. The limits of static security controls. You can’t scan your way out of logic flaws that only manifest during execution. So, what does “operating safely” look like in light of React2Shell and EtherRAT? Production behavior is a new security perimeter. Attackers are already operating comfortably inside it, and with clarity, defenders will catch up. There’s no blame or need to slow innovation. Treat SSR code paths with the same scrutiny as backend logic and use runtime detections based on normal and irregular behaviors, not just known threats. The post A security checklist for your React and Next.js apps appeared first on The New Stack.
Read more →

Nvidia makes AI weather forecasting more accessible, no supercomputer needed

With very few exceptions, large-scale weather forecasting has been the domain of government agencies with access to massive supercomputers. But that is changing. Nvidia launched two open source weather forecasting models today: Earth-2 Medium Range and Earth-2 Nowcasting. In addition, it is launching a tool that will significantly speed up the generation of starting conditions for these models. Mike Pritchard, Nvidia’s director of climate simulation, tells The New Stack, “The stakes can’t be higher in weather.” “Worsening extreme weather, driven by climate change, is having impacts on all of us and nearly every aspect of modern life. Forecasting affects us all. It can drive improvements to agriculture, energy, aviation, and emergency response, but the science of forecasting is changing,” Pritchard says. AI has sparked a “scientific revolution in weather forecasting,” Pritchard argues, but researchers have struggled to move this work out of the lab and into practical solutions. “We need to lower the barrier to entry so developers can build tools in the open.” This isn’t Nvidia’s first foray into the weather forecasting business. As part of Earth-2, its effort to build a digital twin of Earth, it previously launched two other models. The first is Earth-2 CoreDiff, a model that takes continental-scale predictions and downscales them to high-res local ones up to 500 times faster than traditional methods. The second is Earth-2 FourCastNet3, a highly efficient global forecasting model that can run on a single Nvidia H100 GPU. Accurate forecasts aren’t just useful for deciding whether to take an umbrella or not. These models are critical infrastructure for airlines, insurers, energy providers, and agriculture. Nvidia’s new weather models Both of the previous models — and most other existing AI-based forecasting models — use specialized model architectures and do not use the transform-based approach that is now the default for modern large language models (LLMs). For the new Medium Range and Nowcasting models, Nvidia adapted exactly this transformer architecture. Transformer-based architectures, after all, are backed by the performance and engineering tooling of virtually every other AI company. “Philosophically, scientifically, it’s a return to simplicity,” Pritchard says. “We’re moving away from hand-tailored niche AI architectures and leaning into the future of simple, scalable transformer architectures.” The Medium Range model, as its name implies, is meant to provide high-accuracy forecasts for up to 15 days in the future. The Nvidia Earth-2 Medium Range model in action. (Credit: Nvidia) Nvidia hasn’t provided The New Stack with detailed benchmarks yet, but Pritchard argues that the Medium Range model outperforms DeepMind’s GenCast, the current leader in this space, “across more than 70 weather variables,” including temperature, pressure, and humidity. The Nowcasting model is maybe even more interesting, though: It generates country-scale forecasts at kilometer resolution — a very high resolution for any modern model. Most of the models that inform weather forecasts in Europe or North America have a resolution of two kilometers or more, while the U.S. National Oceanic and Atmospheric Administration’s (NOAA) GFS model, which is available for free and is often the default in free weather apps, has a resolution of 13 kilometers (though NOAA has also started implementing AI forecasts recently). The Israeli Meteorological Service plans to use the Nowcasting model to generate high-resolution forecasts up to eight times daily going forward. The organization already uses Nvidia’s older CoreDiff model. Similarly, The Weather Company (the company behind weather.com) plans to use Nowcasting for localized severe-weather applications. No supercomputer needed For the Medium Range model, which comes in a few variants ranging from 2.4 billion parameters to 3.3 billion, the training was done on 32 80GB A100/H100 GPUs. But to run the model, you would only need 26GB of GPU memory and an A100 GPU can run a single time-step prediction that covers 6 or 12 hours. Depending on the model, it only takes 140 seconds for the GenCast Model, 94 and 88 seconds for the two other Medium Range variants (dubbed Atlas-SI and Atlas EDM) and under four seconds for the Atlas-CRPS model (which has additional noise conditioning and is a bit larger at 3.3 billion parameters. For the Nowcasting model, each 6km-resolution model requires only 5GB of GPU memory and can run in 33 seconds on a single H100 GPU at maximum precision. “We expect the inference speed to be greatly accelerated by techniques such as distillation and/or reduced precision,” an Nvidia spokesperson tells us. Data assimilation: The other 50% of the problem For weather forecasts, the starting data from which the model begins generating its forecast is crucial. That can be satellite imagery, radar data, sensor data from weather balloons, airplanes, and buoys. All of this data needs to be normalized and transformed so the models can work with it. Climate scientists call this process “assimilation.” To accelerate this hours-long process, Nvidia also launched the Global Data Assimilation model, which produces these initial snapshots of the global weather within seconds. “While the AI community and the research community have focused a lot on the prediction models over the past five years, this data assimilation task, this state estimation task, has remained largely unsolved by AI, yet it consumes roughly 50% of the total supercomputing loads of traditional weather [forecasting],” says Pritchard. The assimilation model is actually quite small, at 330M parameters. Using one H100 GPU, it can run the full inference pipeline in under a second, all while using less than 20GB of GPU memory. It still seems unlikely — but possible — that even these efficient models will allow hobbyists to start creating their own forecasts anytime soon. Simply acquiring and managing the starting data, after all, is a major data problem. But for an enterprise with the right use case and resources, this may just open the door to creating local forecasts without the need to access a supercomputing cluster. Update: We updated this post after publication to include the compute requirements for these models. The post Nvidia makes AI weather forecasting more accessible, no supercomputer needed appeared first on The New Stack.
Read more →

Meh

My thanks to Meh for sponsoring last week at DF. Meh puts up a new deal every day, and they do it with panache. As they say, “It’s actual, real, weird shit you didn’t know existed for half the price you would’ve guessed.” Don’t tell any of my other sponsors, but Meh is my favorite longtime DF sponsor. I love the way their orange graphics look against DF’s #4a525a background. And I always love their sponsored posts that go into the RSS feed at the start of the sponsorship week. I’ll just quote theirs from this week in full: Everything sucks. The whole world’s going to shit, especially our part of it, and it can feel like anything fun or silly is sticking your head in the sand. And yet. It doesn’t help to just be miserable. If you’re going to last, you’ve got to find your little moments of joy, or as a break from the misery. Buying our crap at Meh is not how you solve the world’s problems. We’re not that crass. But maybe a minute a day of reading our little write-up, and a couple minutes of catching up with the Meh community, of making a few new online friends, and yes, of occasionally picking up a weird gadget or strange snack you’ve never heard of is just a few minutes you get to take a break, not giving in to how bad everything else is. Of course we would say that. Of course we benefit from that. But it is also part of why we have a quirky write-up. Why we have a community. Why we’re selling whatever weird thing is over at Meh today. ★
Read more →

★ The iOS 26 Adoption Rate Is Not Bizarrely Low Compared to Previous Years

A few weeks ago there were a rash of stories claiming that iOS 26 is seeing bizarrely low adoption rates from iPhone users. The methodology behind these numbers is broken and the numbers are totally wrong. Those false numbers are so low, so jarringly different from previous years, that it boggles my mind that they didn’t raise a red flag for anyone who took a moment to consider them. The ball started rolling with this post from Ed Hardy at Cult of Mac on January 8, “iOS 26 Still Struggles to Gain Traction With iPhone Users”, which began: Only a tiny percentage of iPhone users have installed iOS 26, according to data from a web analytics service. The adoption rate is far less than previous iOS versions at this same point months after their releases. The data only reveals how few iPhone users run Apple’s latest operating system upgrade, not why they’ve chosen to avoid it. But the most likely candidate is the new Liquid Glass look of the update. [...] Roughly four months after launching in mid-September, only about 15% of iPhone users have some version of the new operating system installed. That’s according to data for January 2026 from StatCounter. Instead, most users hold onto previous versions. For comparison, in January 2025, about 63% of iPhone users had some iOS 18 version installed. So after roughly the same amount of time, the adoption rate of Apple [sic] newest OS was about four times higher. Those links point to Statcounter, a web analytics service. A lot of websites include Statcounter’s analytics tracker, and Statcounter’s tracker attempts to determine the version of the OS each visitor’s device is running. The problem is, starting with Safari 26 — the version that ships with iOS 26 — Safari changed how it reports its user agent string. From the WebKit blog, “WebKit Features in Safari 26.0”: Also, now in Safari on iOS, iPadOS, and visionOS 26 the user agent string no longer lists the current version of the operating system. Safari 18.6 on iOS has a UA string of: Mozilla/5.0 (iPhone; CPU iPhone OS 18_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/18.6 Mobile/15E148 Safari/604.1 And Safari 26.0 on iOS has a UA string of: Mozilla/5.0 (iPhone; CPU iPhone OS 18_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Mobile/15E148 Safari/604.1 This matches the long-standing behavior on macOS, where the user agent string for Safari 26.0 is: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Safari/605.1.15 It was back in 2017 when Safari on Mac first started freezing the Mac OS string. Now the behavior on iOS, iPadOS, and visionOS does the same in order to minimize compatibility issues. The WebKit and Safari version number portions of the string will continue to change with each release. In other words, Safari now reports, in its user agent string, that it’s running on iOS 18.6 when it is running on iOS 18.6, and reports that it’s running on iOS 18.6 when it’s running on iOS 26.0 or later. And it’s going to keep reporting that it’s running on iOS 18.6 forever, just like how Safari 26 on MacOS reports that it’s running on MacOS 10.15 Catalina, from 2019. Statcounter completely dropped the ball on this change, and it explains the entirety of this false narrative that iOS 26 adoption is incredibly low. (Statcounter has a “detect” page where you can see what browser and OS it thinks you’re using.) The reason they reported that 15 percent of iPhone users were using iOS 26 is probably because that’s the amount of web traffic Statcounter sees from iOS 26 web browsers that aren’t Safari (most of which, I’ll bet, are in-app browser views in social media apps). Nick Heer, at Pixel Envy, wrote a good piece delving into this saga. And then he posted a follow-up item pointing out that (a) Statcounter’s CEO has acknowledged their error and they’re fixing it; and (b) Wikimedia publishes network-wide stats that serve as a good baseline. The audience for Wikipedia is, effectively, the audience for the web itself. And Wikipedia’s stats show that while iOS 26 adoption, in January 2026, isn’t absurdly low (as Statcounter had been suggesting, erroneously, and writers like Ed Hardy at Cult of Mac and David Price at Macworld foolishly regurgitated, no matter how little sense it made that the numbers would be that low), they are in fact lower than those for iOS 18 a year ago and iOS 17 two years ago. Per Wikimedia: iOS 26, January 2026: 50% iOS 18, January 2025: 72% iOS 17, January 2024: 65% So, no, iOS 26 adoption isn’t at just 15 percent, which only a dope would believe, but it’s not as high as previous iOS versions in previous years at this point on the calendar. Something, obviously, is going on. David Smith, developer of popular apps like Widgetsmith and Pedometer++, on Mastodon: I noticed iOS 26 adoption had entered a ‘third wave’ of rapid adoption. So I made a graph of the relative adoption versus iOS 18 at this point in the release cycle. While lower than iOS 18 at this point for my apps (65% vs. 78%), the shape of this graph says to me that Apple is in full control of the adoption rate and can tune it to their plans. The coordinated surges are Apple dialing up automatic updates. If this surge were as long as previous ones, we’d hit the saturation point very soon. What’s going on, quite obviously, is that Apple itself is slow-rolling the automatic updates to iOS 26. For years now Apple has steered users, via default suggestions during device setup, to adopt settings to allow OS updates to happen automatically, including updates to major new versions. Apple tends not to push these automatic updates to major new versions of iOS until two months after the .0 release in September. This year that second wave was delayed by about two weeks, and there’s now a third wave starting midway through January. It’s a different pattern from previous years — but it’s a pattern Apple controls. A large majority of users of all Apple devices get major OS updates when, and only when, their devices automatically update. Apple has been slower to push those updates to iOS 26 than they have been for previous iOS updates in recent years. With good reason! iOS 26 is a more significant — and buggier — update than iOS 18 and 17 were. People like you, readers of Daring Fireball, may well be hesitant to update to iOS 26, or (like me) to MacOS 26, or to any of the version 26 OS updates, because you are aware of things (like UI changes) that you are loath to adopt. But the overwhelming majority of Apple users — especially iPhone users — just let their devices update automatically. They might like iOS 26’s changes, they might dislike them, or they might not care or even notice. But they just let their software updates happen automatically — and they will form the entirety of their opinions regarding iOS 26 after it’s running on their iPhones.
Read more →

The Value of Things

Comments
Read more →

Box64 Expands into RISC-V and LoongArch territory

Comments
Read more →

★ Tahoe Added a Finder Option to Resize Columns to Fit Filenames

The main reason I’m sticking with MacOS 15 Sequoia, refusing to install 26 Tahoe, is that there are so many severe UI regressions in Tahoe. The noisy, distracting, inconsistent icons prefixing menu item commands, ruining the Mac’s signature menu bar system. Indiscriminate transparency that renders so many menus, windows, and sidebars inscrutable and ugly. Windows with childish round corners that are hard to resize. The comically sad app icons. Why choose to suffer? But the thing that makes the decision to stay on 15 Sequoia a cinch is that I honestly struggle to think of any features in Tahoe that I’m missing out on. What is there to actually like about Tahoe? One small example is Apple’s Journal app. I’ve been using Journal ever since it debuted as an iPhone-only app in iOS 17.2 in December 2023. 785 entries and counting. With the version 26 OSes, Apple created versions of Journal for iPad and Mac (but not Vision Pro). Syncing works great via iCloud too. All things considered, I’d like to have a version of Journal on my main Mac. But I’m fine without it. I’ve been writing entries without a Mac app since 2023, so I’ll continue doing what I’ve been doing, if I want to create or edit a Journal entry from my Mac: using iPhone Mirroring. That’s it. The Journal app is the one new feature Tahoe offers that I wish I had today. I’m not missing out on the latest version of Safari because Apple makes Safari 26 available for MacOS 15 Sequoia (and even 14 Sonoma). Some years, Apple adds new features to Apple Notes, and to get those features on every device, you need to update every device to that year’s new OS. This year I don’t think there are any features like that. Everything is perfectly cromulent running iOS 26 on my iPhone and iPad, but sticking with MacOS 15 Sequoia on my primary Mac. But now that we’ve been poking around at column view in the Tahoe Finder, Jeff Johnson has discovered another enticing new feature. On Mac OS 26, the Finder has a new view option (accessed via View → Show View Options) to automatically resize columns to fit the longest visible filename. See Johnson’s post for screenshots of the new option in practice. [Update: Turns out, this auto-resizing feature has been a hidden preference setting in the Finder for a few years now.] Column view is one of the best UI innovations from NeXTStep, and if you think about it, has always been the primary metaphor for browsing hierarchical applications in iOS. It’s a good idea for the desktop that proved foundational for mobile. The iPhone Settings app is column view — one column at a time. It’s a way to organize a multi-screen app in a visual, spatial way even when limited to a 3.5-inch display. Thanks to Greg’s Browser, a terrific indie app, I’d been using column view on classic Mac OS since 1993, a few years before Apple even bought NeXT, let alone finally shipped Mac OS X (which was when column view first appeared in the Finder). One frustration inherent to column view is that it doesn’t work well with long filenames. It’s a waste of space to resize all columns to a width long enough to accommodate long filenames, but it’s frustrating when a long filename doesn’t fit in a regular-width column. This new feature in the Tahoe Finder attempts to finally solve this problem. I played around with it this afternoon and it’s ... OK. It feels like an early prototype for what could be a polished feature. For example, it exacerbates some layering bugs in the Finder — if you attempt to rename a file or folder that is partially scrolled under the sidebar, the Tahoe Finder will just draw the rename editing field right on top of the sidebar, even though it belongs to the layer that is scrolled underneath. Here’s what it looks like when I rename a folder named “Example ƒ” to “How is this possible?”: On MacOS 15, if you attempt to rename an item that is scrolled under the sidebar in column view, the column containing that item snaps into place next to the sidebar, so it’s fully visible. That snapping into place just feels right. The way Tahoe works, where the column doesn’t move and the text editing field for the filename just gets drawn on top of the sidebar, feels gross, like I’m using a computer that is not a Macintosh. Amateur hour. I wish I could set this new column-resizing option only to grow columns to accommodate long filenames, and never to shrink columns when the visible items all have short filenames. But the way it currently works, it adjusts all columns to the width of the longest visible filename each column is displaying — narrowing some, and widening others. I want most columns to stay at the default width. With this new option enabled, it looks a bit higgledy-piggledy that every column is a different width. Also, it’s an obvious shortcoming that the feature only adjusts columns to the size of the longest currently visible filename. If you scroll down in a column and get to a filename that is too long to fit, nothing happens. It just doesn’t fit. Even a future polished version of this column view feature wouldn’t, in and of itself, be enough to tempt me to upgrade to Tahoe. After 30-some years of columns that don’t automatically adjust their widths, I can wait another year. But we don’t yet have a polished version of this feature. The unpolished version of the feature we have today only reiterates my belief that Tahoe is a mistake to be avoided. It’s a good idea though, and there aren’t even many of those in Tahoe.
Read more →

OmniOutliner 6

Ken Case, on The Omni Group blog: The features noted above already make for a great upgrade. But as I mentioned last year, one of the interesting problems we’ve been pondering is how best to link to documents in native apps. We’ve spent some time refining our solution to that problem, Omni Links, which are now shipping first in OmniOutliner 6. With Omni Links, we can link to content across all our devices, and we can share those links with other people and other apps. Omni Links support everything we said document links needed to have. Omni Links work across all of Apple’s computing platforms and can be shared with a team. They leverage existing solutions for syncing and sharing documents, such as iCloud Drive or shared Git repositories. They are easy to create, easy to use, and easy to share. Omni Links also power up Omni Automation, giving scripts and plug-ins a way to reference and update content in linked documents — documents that can be shared across all your team’s devices. There’s lots more in version 6, including a modernized UI, and many additions to Omni Automation, Omni’s scripting platform that works across both Mac and iOS — including really useful integration with Apple’s on-device Foundation Models, with, of course, comprehensive (and comprehensible) documentation. It’s Omni Links, though, that strikes me as the most interesting new feature. The two fundamental models for apps are library-based (like Apple Notes) and document-based (like TextEdit). Document-based apps create and open files from the file system. Library-based apps create items in a database, and the location of the database in the file system is an implementation detail the user shouldn’t worry about. OmniOutliner has always been document-based, and version 6 continues to be. There are advantages and disadvantages to both models, but one of the advantages to library-based apps is that they more easily allow the developer to create custom URL schemes to link to items in the app’s library. Omni Links is an ambitious solution to bring that to document-based apps. Omni Links let you copy URLs that link not just to an OmniOutliner document, but to any specific row within an OmniOutliner document. And you can paste those URLs into any app you want (like, say, Apple Notes or Things, or events in your calendar app). From the perspective of other apps, they’re just URLs that start with omnioutliner://. They’re not based on anything as simplistic as a file’s pathname. They’re a robust way to link to a unique document, or a specific row within that document. Create an Omni Link on your Mac, and that link will work on your iPhone or iPad too — or vice versa. This is a very complex problem to solve, but Omni Links delivers on the age-old promise of “It just works”, abstracting all the complexity. I’ve been using OmniOutliner for at least two decades now, and Omni Links strikes me as one of the best features they’ve ever added. It’s a way to connect your outlines, and the content within your outlines, to any app that accepts links. The other big change is that OmniOutliner 6 is now a single universal purchase giving you access to the same features on Mac, iPhone, iPad, and Vision. ★
Read more →

Lolgato 1.7

Free Mac utility by Zendit Oy: A macOS app that enhances control over Elgato lights, offering features beyond the standard Elgato Control Center software. Features: Automatically turn lights on and off based on camera activity Turn lights off when locking your Mac Sync light temperature with macOS Night Shift Lolgato also lets you set global hotkeys for toggling the lights and changing their brightness. I’ve had a pair of Elgato Key Lights down at my podcast recording desk for years now. Elgato’s shitty software drove me nuts. Nothing seemed to work so I gave up on controlling my lights from software. I set the color temperature and brightness the way I wanted it (which you have to do via software) and then after that, I just turned them off and on using the physical switches on the lights. I forget how I discovered Lolgato, but I installed back on November 10. I connected Lolgato to my lights, and set it to turn them on whenever the Mac wakes up, and off whenever the Mac goes to sleep. It has worked perfectly for over two months. Perfect little utility. ★
Read more →

Playing the Percentages

Dr. Drang: For weeks — maybe months, time has been hard to judge this past year — Trump has been telling us that he’s worked out deals with pharmaceutical companies to lower their prices by several hundred percent. Commentators and comedians have pointed out that you can’t reduce prices more than 100% and pretty much left it at that, suggesting that Trump’s impossible numbers are due to ignorance. Don’t get me wrong. Trump’s ignorance is nearly limitless — but only nearly. I’ve always thought that he knew the right way to calculate a price drop; he did it the wrong way so he could quote a bigger number. And that came out in yesterday’s speech. Trump sophistry + math pedantry = Daring Fireball catnip. ★
Read more →

MacOS 26 Tahoe Broke Column View in the Finder

Jeff Johnson: Finder has four view modes, represented by the four consecutive toolbar icons in the screenshot below, if you can even call that free-floating monstrosity a toolbar anymore: Icons, List, Columns, and Gallery. My preference is columns view, which I’ve been using for as long as I remember, going back to Mac OS X. At the bottom of each column is a resizing widget that you can use to change the width of the columns. Or rather, you could use it to change the width of the columns. On macOS Tahoe, the horizontal scroller covers the resizing widget and prevents it from being clicked! I joked last week that it would make more sense if we found out that the team behind redesigning the UI for MacOS 26 Tahoe was hired by Meta not a month ago, but an entire year ago, and secretly sabotaged their work to make the Mac look clownish and amateur. More and more I’m wondering if the joke’s on us and it actually happened that way. It’s like MacOS, once the crown jewel of computer human interface design, has been vandalized. ★
Read more →

Where to Sleep in LAX

Comments
Read more →

EmulatorJS

Comments
Read more →

Why Walmart Still Doesn’t Support Apple Pay

Chance Miller, writing at 9to5Mac: When you use Walmart Pay, it’s incredibly easy for Walmart to build that customer profile on you. When you use Scan and Go, all of that same information is handed over. When you use Apple Pay or other payment methods, it’s much harder for Walmart (and other retailers) to do this. Apple Pay’s privacy and security protections, like not sharing any information about your actual card with the retailer, makes this type of tracking trickier. This is why Walmart wants people to use Walmart Pay if they want to pay from their phone. If you check out with Walmart Pay or Scan and Go, everything is linked to your Walmart account. If you had the option to pay with Apple Pay, you’d share a lot less information with Walmart. Using Walmart Pay gives Walmart more information than a regular credit or debit card transaction does. When you use the same traditional credit card for multiple purchases over time, a retailer like Walmart can build a profile associated with that card number. Charles Duhigg, all the way back in 2012, reported a story for The New York Times about how Target used these profiles — which customers don’t even know about — to statistically determine when women are likely to be pregnant based on purchases like, say, cocoa-butter lotion and vitamin supplements. When you use an in-house payment app like Walmart Pay (or swipe a store’s “loyalty” card at the register), the store doesn’t have to do any guesswork to associate the transaction with your profile. Your Walmart Pay account is your profile. Using Apple Pay gives a retailer less — or at least no more — identifying information than a traditional card transaction. So if the future is paying via devices, Walmart wants that future to give them more information. I think the situation with Walmart and Apple Pay is a lot like Netflix and Apple TV integration. Most retailers, even large ones, support Apple Pay. Most streaming services, even large ones, support integration with Apple’s TV app. Walmart doesn’t support Apple Pay because they want to control the customer transaction directly, and they’re big enough, and their customers are loyal enough, that they can resist supporting Apple Pay. Netflix doesn’t support TV app integration because they want to control the customer viewing experience directly, and they’re big enough, and their customers are loyal enough, that they can resist supporting Apple’s TV app. Amazon — which is also very large, whose customers are also very loyal, and which absolutely loves collecting data — does not support Apple Pay either. See also: Michael Tsai. ★
Read more →

Trump Administration Shares Doctored Photo of Minnesota Activist After Her Arrest

Violet Jira, reporting for NOTUS: The White House communications team posted a digitally altered photo of Nekima Levy Armstrong, a Minnesota social justice activist, on Thursday that makes it appear that she was weeping during her arrest by federal agents. The image is highly realistic, bearing no watermark or other indicator that the image has been doctored. The change is only apparent when compared to a different version of the same image posted by the Department of Homeland Security earlier in the day. The White House, which has adopted a combative, flippant tone on its widely viewed social media pages, drew some backlash for the post online. In response, White House deputy communications director Kaelan Dorr called the image a “meme.” It’s not a meme. It’s propaganda — an altogether false image presented as an actual photograph. ★
Read more →

The Information: ‘With Google Deal, Apple’s Craig Federighi Plots a Cautious Course in AI’

Aaron “Homeboy” Tilley and Wayne Ma, reporting for The Information (paywalled, alas, and with a miserly gift-link policy): But there are also potential risks to making Federighi head of AI. Giving oversight of AI to him reflects Apple’s cautious approach to the technology. He is known at Apple as a penny-pincher who keeps a tight rein on salaries and hesitates to invest in risky projects when the payoff from them isn’t clear, according to people who have worked with him. He tends to scrutinize every detail of his team’s expenses, down to their budgets for bananas and other office snacks, those people said. Meanwhile, Apple’s rivals are pouring vast amounts of capital into AI, building data centers and paying fortunes to woo AI researchers. I have no idea what Federighi’s stance is on break-room bananas, but it seems a stretch to think it offers clues to Apple’s strategy on data centers. For years, lieutenants of Federighi would try to get him on board with AI. He often shot those efforts down, former Apple executives said. For example, he rejected proposals from his team to use AI to dynamically change the iPhone home screen, believing it would disorient users, who are used to knowing where their apps are located, said former Apple employees familiar with the proposal. Jesus H. Christ, thank god Federighi shot this down. I wouldn’t want good AI rearranging my home screen behind my back, let alone Apple Intelligence as we know it. ★
Read more →

The Information Says Apple Is Working on an AI Wearable Pin

Wayne Ma and Qianer Liu, reporting for The Information (paywalled, alas): Apple is developing an AI-powered wearable pin the size of an AirTag that is equipped with multiple cameras, a speaker, microphones and wireless charging, according to people with direct knowledge of the project. The device could be released as early as 2027, they said. Don’t make the mistake of thinking that because existing AI pins have sucked (and in one notable case, flopped in spectacular fashion), they’re all going to suck. Google Glass was an embarrassment but glasses are a great form factor. MP3 players used to suck too. Such a product would position Apple to compete more effectively with OpenAI, which is planning its own AI-powered devices, and Meta Platforms, which is already selling smart glasses that offer access to its AI assistant. It is very strange to put OpenAI’s upcoming io device(s) in the same sentence as Meta’s glasses, which are a real product you can buy today. None of these things are setting the world on fire though. ★
Read more →

Ternus Now Overseeing Design at Apple, Reports Gurman

Mark Gurman, reporting at Bloomberg: Apple Inc. has expanded the job of hardware chief John Ternus to include design work, solidifying his status as a leading contender to eventually succeed Chief Executive Officer Tim Cook. Cook, who has led Apple since 2011 and turned 65 in November, quietly tapped Ternus to manage the company’s design teams at the end of last year, according to people with knowledge of the matter. That widens Ternus’ role to add one of the company’s most critical functions. And on Twitter/X: Ternus is now the “executive sponsor” of Apple’s design team, representing the critical function on Apple’s executive team. The move was under-the-radar: on paper, the teams report to Tim Cook despite Ternus’s role. Here’s to hoping Ternus is as pissed as the rest of us are about MacOS 26 Tahoe. ★
Read more →

Jackass of the Week: Utah State Senate Majority Leader Kirk Cullimore

Bridger Beal-Cvetko and Daniel Woodruff, reporting for KSL News: SB138, sponsored by Cullimore, R-Sandy, would make Android, the world’s most popular mobile device operating system, an official state symbol, joining the ranks of the official state cooking pot (the dutch oven), the official state crustacean (the brine shrimp), and the official state mushroom (the porcini). “Someday, everybody with an iPhone will realize that the technology is better on Android,” Cullimore told reporters during a media availability on Wednesday, the second day of the legislative session. But, he added, “I’m the only one in my family — all my kids, my wife, they all have iPhones — but I’m holding strong.” [...] “I don’t expect this to really get out of committee,” he said. (Via Joe Rossignol.) ★
Read more →

Taegan Goddard: ‘There’s No Going Back’

Taegan Goddard, writing at Political Wire, in a post that pairs perfectly with Om Malik’s re: velocity bestowing authority: The new Democratic argument isn’t about restoring guardrails. It’s about moving fast — and using power unapologetically — to undo what Trump has done. New Jersey will inaugurate Mikie Sherrill as governor today, one of the party’s rising stars who steamrolled Republicans in November. She has promised to govern with urgency — leaning on emergency powers, acting decisively, and skipping the old incrementalism. This, she argues, is what voters now expect. She told The New Yorker that if Democrats don’t learn to work at Donald Trump’s pace, “we’re going to get played.” Rep. Alexandria Ocasio-Cortez is even more explicit: “In order for us to correct the abuses that are happening now, we have to act in the same capacities that Trump has given himself.” The only way to counter “move fast and break things” is to move fast and fix things. ★
Read more →

Om Malik: ‘Velocity Is the New Authority’

Om Malik: That’s why we get all our information as memes. The meme has become the metastory, the layer where meaning is carried. You don’t need to read the thing; you just need the gist, compressed and passed along in a sentence, an image, or a joke. It has taken the role of the headline. The machine accelerates this dynamic. It demands constant material; stop feeding it and the whole structure shakes. The point of the internet now is mostly to hook attention and push it toward commerce, to keep the engine running. Anyone can get their cut. [...] We built machines that prize acceleration and then act puzzled that everything feels rushed and slightly manic. Crackerjack essay. Malik is focused here on the ways we’ve changed media and how those changes to media have changed us — as a society, and as individuals. But I think it explains how the Trump 2.0 administration has been so effective (such that it can be said to be effective). They recognize that velocity is authority and are moving as fast as they can. It’s an adaptation to a new media age. ★
Read more →

‘Inside Trump’s Head-Spinning Greenland U-Turn’

The Wall Street Journal (gift link; News+ link): When President Trump arrived in the snow-covered Swiss Alps on Wednesday afternoon, European leaders were panicking that his efforts to acquire Greenland would trigger a trans-Atlantic conflagration. By the time the sun set, Trump had backed down. After a meeting with Rutte on Wednesday, Trump called off promised tariffs on European nations, contending that he had “formed the framework of a future deal” with respect to the largest island in the world. [...] During an hourlong speech at the World Economic Forum, the U.S. president said he wouldn’t deploy the military to take control of Greenland. It was a stark shift in tone for Trump, who just days earlier had declined to rule out using the military to secure ownership of Greenland and posted an image online of the territory with an American flag plastered across it. No need for panic. Alarm, yes. Panic, no. The TACO theory holds. Stand up to Trump and he’ll chicken out. ★
Read more →

The Scale of ICE Protests in Minnesota

Margaret Killjoy, in a thread on Bluesky (via Kottke): I came to Minneapolis to report on what’s going on, and one of the main questions I showed up with is “just what is the scale of the resistance?” After all, we’re all used to the news calling Portland a “war zone” or whatever when it’s just some protests in one part of town. [...] Half the street corners around here have people — from every walk of life, including republicans — standing guard to watch for suspicious vehicles, which are reported to a robust and entirely decentralized network that tracks ICE vehicles and mobilizes responders. I have been actively involved in protest movements for 24 years. I have never seen anything approaching this scale. Minneapolis is not accepting what’s happening here. ICE fucking murdered a woman for participating in this, and all that did is bring out more people, from more walks of life. It’s genuinely a leaderless (or leaderful) movement, decentralized in a way that the state is absolutely unequipped to handle. There are a few basic skills involved, and so people teach each other those skills, and people are collectively refining them. Apple’s “whatever you say, boss” compliance with the Trump administration’s “demand” back in October that they remove ICEBlock from the App Store — with no legal basis, nor any evidence backing the administration’s claims that the app was being used to put members of the ICE goon squads in danger — is looking more and more like a decision on the wrong side of popular opinion. And, ultimately, on the wrong side of history. ICEBlock was designed for exactly what these protestors are doing. ★
Read more →

Fragments: January 22

My colleagues here at Thoughtworks have announced AI/works™, a platform for our work using AI-enabled software development. The platform is in its early days, and is currently intended to support Thoughtworks consultants in their client work. I’m looking forward to sharing what we learn from using and further developing the platform in future months. ❄ ❄ ❄ ❄ ❄ Simon Couch examines the electricity consumption of using AI. He’s a heavy user: “usually programming for a few hours, and driving 2 or 3 Claude Code instances at a time”. He finds his usage of electricity is orders of magnitude more than typical estimates based on the “typical query”. On a median day, I estimate I consume 1,300 Wh through Claude Code—4,400 “typical queries” worth. But it’s still not a massive amount of power - similar to that of running a dishwasher. A caveat to this is that this is “napkin math” because we don’t have decent data about how these models use resources. I agree with him that we ought to. ❄ ❄ ❄ ❄ ❄ My namesake Chad Fowler (no relation) considers that the movement to agentic coding creates a similar shift in rigor and discipline as appeared in Extreme Programming, dynamic languages, and continuous deployment. In Extreme Programming’s case, this meant a lot of discipline around testing, continuous integration, and keeping the code-base healthy. My current view is that with AI-enabled development we need to be rigorous about evaluating the software, both for its observable behavior and its internal quality. The engineers who thrive in this environment will be the ones who relocate discipline rather than abandon it. They’ll treat generation as a capability that demands more precision in specification, not less. They’ll build evaluation systems that are harder to fool than the ones they replaced. They’ll refuse the temptation to mistake velocity for progress. ❄ ❄ ❄ ❄ ❄ There’s been much written about the dreadful events in Minnesota, and I’ve not felt I’ve had anything useful to add to them. But I do want to pass on an excellent post from Noah Smith that captures many of my thoughts. He points out that there is a “consistent record of brutality, aggression, dubious legality, and unprofessionalism” from ICE (and CBP) who seem to be turning into MAGA’s SD. Is this America now? A country where unaccountable and poorly trained government agents go door to door, arresting and beating people on pure suspicion, and shooting people who don’t obey their every order or who try to get away? “When a federal officer gives you instructions, you abide by them and then you get to keep your life” is a perfect description of an authoritarian police state. None of this is Constitutional, every bit of it is deeply antithetical to the American values we grew up taking for granted. My worries about these kinds of developments were what animated me to urge against voting for Trump in the 2016 election. Mostly those worries didn’t come to fruition because enough constitutional Republicans were in a position to stop them from happening, so even when Trump attempted a coup in 2020, he wasn’t able to get very far. But now those constitutional Republicans are absent or quiescent. I fear that what we’ve seen in Minneapolis will be a harbinger of worse to come. I also second John Gruber’s praise of bystander Caitlin Callenson: But then, after the murderous agent fired three shots — just 30 or 40 feet in front of Callenson — Callenson had the courage and conviction to stay with the scene and keep filming. Not to run away, but instead to follow the scene. To keep filming. To continue documenting with as best clarity as she could, what was unfolding. The recent activity in Venezuala reminds me that I’ve long felt that Trump is a Hugo Chávez figure - a charismatic populist who’s keen on wrecking institutions and norms. Trump is old, so won’t be with us for that much longer - but the question is: “who is Trump’s Maduro?” ❄ ❄ ❄ ❄ ❄ With all the drama at home, we shouldn’t ignore the terrible things that happened in Iran. The people there again suffered again the consequences of an entrenched authoritarian police state.
Read more →

Build an agent into any app with the GitHub Copilot SDK

Building agentic workflows from scratch is hard. You have to manage context across turns, orchestrate tools and commands, route between models, integrate MCP servers, and think through permissions, safety boundaries, and failure modes. Even before you reach your actual product logic, you’ve already built a small platform. GitHub Copilot SDK (now in technical preview) removes that burden. It allows you to take the same Copilot agentic core that powers GitHub Copilot CLI and embed it in any application. This gives you programmatic access to the same production-tested execution loop that powers GitHub Copilot CLI. That means instead of wiring your own planner, tool loop, and runtime, you can embed that agentic loop directly into your application and build on top of it for any use case. You also get Copilot CLI’s support for multiple AI models, custom tool definitions, MCP server integration, GitHub authentication, and real-time streaming. How to get started We’re starting with support for Node.js, Python, Go, and .NET. You can use your existing GitHub Copilot subscription or bring your own key. The github/copilot-sdk repository includes: Setup instructions Starter examples SDK references for each supported language A good first step is to define a single task like updating files, running a command, or generating a structured output and letting Copilot plan and execute steps while your application supplies domain-specific tools and constraints. Here’s a short code snippet to preview how you can call the SDK in TypeScript: import { CopilotClient } from "@github/copilot-sdk"; const client = new CopilotClient(); await client.start(); const session = await client.createSession({ model: "gpt-5", }); await session.send({ prompt: "Hello, world!" }); Visit github/copilot-sdk to start building. What’s new in GitHub Copilot CLI Copilot CLI lets you plan projects or features, modify files, run commands, use custom agents, delegate tasks to the cloud, and more, all without leaving your terminal. Since we first introduced it, we’ve been expanding Copilot’s agentic workflows so it: Works the way you do with persistent memory, infinite sessions, and intelligent compaction. Helps you think with explore, plan, and review workflows where you can choose which model you want at each step. Executes on your behalf with custom agents, agent skills, full MCP support, and async task delegation. How does the SDK build on top of Copilot CLI? The SDK takes the agentic power of Copilot CLI (the planning, tool use, and multi-turn execution loop) and makes it available in your favorite programming language. This makes it possible to integrate Copilot into any environment. You can build GUIs that use AI workflows, create personal tools that level up your productivity, or run custom internal agents in your enterprise workflows. Our teams have already used it to build things like: YouTube chapter generators Custom GUIs for their agents Speech-to-command workflows to run apps on their desktops Games where you can compete with AI Summarizing tools And more! Think of the Copilot SDK as an execution platform that lets you reuse the same agentic loop behind the Copilot CLI, while GitHub handles authentication, model management, MCP servers, custom agents, and chat sessions plus streaming. That means you are in control of what gets built on top of those building blocks. Start building today! Visit the SDK repository to get started. The post Build an agent into any app with the GitHub Copilot SDK appeared first on The GitHub Blog.
Read more →

[Sponsor] Meh

Everything sucks. The whole world’s going to shit, especially our part of it, and it can feel like anything fun or silly is sticking your head in the sand. And yet. It doesn’t help to just be miserable. If you’re going to last, you’ve got to find your little moments of joy, or as a break from the misery. Buying our crap at Meh is not how you solve the world’s problems. We’re not that crass. But maybe a minute a day of reading our little write-up, and a couple minutes of catching up with the Meh community, of making a few new online friends, and yes, of occasionally picking up a weird gadget or strange snack you’ve never heard of is just a few minutes you get to take a break, not giving in to how bad everything else is. Of course we would say that. Of course we benefit from that. But it is also part of why we have a quirky write-up. Why we have a community. Why we’re selling whatever weird thing is over at Meh today. ★
Read more →

Conversation: LLMs and the what/how loop

A conversation between Unmesh Joshi, Rebecca Parsons, and Martin Fowler on how LLMs help us shape the abstractions in our software. We view our challenge as building systems that survive change, requiring us to manage our cognitive load. We can do this by mapping the “what” of we want our software to do into the “how” of programming languages. This “what” and “how” are built up in a feedback loop. TDD helps us operationalize that loop, and LLMs allow us to explore that loop in an informal and more fluid manner. more…
Read more →

A cheat sheet to slash commands in GitHub Copilot CLI

Do you ever feel like you’re spending more time moving between different tools than you are writing code? If you thrive in the terminal and want faster, more predictable ways to run tests, fix code, and manage context, Copilot CLI slash commands give you that control without breaking your flow. You can use slash commands to perform a variety of tasks like configuring which AI model to use or setting up an MCP server, or even sharing your session externally. Slash commands offer fast, repeatable actions without needing to craft a new prompt each time. TL;DR: See all the slash commands and what they do at the bottom of this post. 😉 What are slash commands? A slash command is a simple instruction, like /clear or /session, that tells Copilot exactly what you want to do. They are prefixed with a / and instantly trigger Copilot to carry out context-aware actions. To start using slash commands , open Copilot CLI and type / to see a list of available commands. How to use slash commands Type / in the Copilot CLI to see a list of available slash commands and their descriptions. You can also use /help to get more details about what each command does and how to use it. For instructions and examples, keep scrolling! Start here (two minutes) Open Copilot CLI Type /help to see available commands Run /clear to reset context Run /cwd to confirm Copilot is scoped to the right directory. You can jump to the sections below based on what you’re trying to do. Learn more in our docs > In addition to Copilot CLI, you can use slash commands across Copilot Chat and with agent mode, too. Why use slash commands? As developers, we want tools that work fast in the terminal. Slash commands in Copilot CLI do just that. Instead of writing a new prompt for each task, you use quick, explicit, and repeatable commands directly in your workflow. In practice, they help with: Speed and predictability: With slash commands, Copilot’s actions are more transparent and predictable. Unlike natural language prompts, which can be interpreted in different ways, slash commands always trigger the same response. This removes guesswork because you always know what you’re going to get, instantly. Productivity: Before slash commands, you might have copied and pasted code, written long prompts, or switched back and forth between tools. Now you can clean up errors, run tests, and get code explanations right from the CLI, without leaving your terminal. Clarity and security: Commands like /add-dir and /list-dirs give clear boundaries for file access and create an auditable trail, which is essential for teams working in sensitive environments. This eliminates uncertainty about what’s happening behind the scenes, reduces the risk of accidental data exposure, and helps teams maintain control in sensitive environments. Better accessibility: Slash commands fit seamlessly into keyboard-driven and accessible workflows. Commands like /help provide an instant overview of available actions, while /list-dirs or /list-files let users browse without navigating complex interfaces. These commands enable users who rely on keyboard shortcuts or assistive technologies to quickly discover and use Copilot features. Trust and compliance: Slash commands enhance trust by making every Copilot action explicit and traceable. For example, teams can use /add-dir to grant Copilot access to a specific directory. This ensures that sensitive files stay protected. With slash commands like /session or /usage, teams can manage tool access, monitor activity, and stay compliant. Custom workflows and extensibility: As support for slash commands expands, you can tailor Copilot to work with your own tasks and automations. Delegate pull requests, switch agents, or connect to CI/CD pipelines, all from the CLI, with commands like /delegate, /agent, and /mcp. Think of slash commands as explicit shortcuts for things you already do. There’s a lot you can do with Copilot CLI, and slash commands make the process easier. Useful Copilot CLI slash commands for your everyday workflow Below are the most commonly used slash commands, grouped by what you typically need to control in your workflows: context, scope, configuration, and collaboration. 💡 Tip: If you only remember three commands, start with /clear, /cwd, and /model. These give you immediate control over context, scope, and output quality. Session management commands /clear: Delete the current session’s conversation history. Copilot accumulates context as you work. This inherited context can muddy suggestions when you have too much of it, or when you’re trying to switch tasks. /clear lets you quickly wipe the slate when you’re multitasking or working between projects. When to use: Switching to a new task or repository Copilot responses are referencing old files or earlier conversations You want to avoid context bleed between projects /exit, /quit: Exit the CLI. The commands /exit and /quit provide a direct way to end your session and disconnect from Copilot, ensuring resource cleanup and a clear boundary for session-based work. When to use: Wrapping up your session Logging out of a shared terminal /session, /usage: Display session usage metrics about the current CLI session. These commands give visibility into the actions Copilot has performed during your session, helping with audits, troubleshooting, and resource tracking. When to use: Auditing team/individual Copilot CLI usage Reviewing model or tool usage during a session Debugging runs or model use When you run either the /session or /usage commands, Copilot shows output similar to the following, displaying usage metrics about your session: Session ID: 221b5571-3998-47e1-b57a-552cf9078947 Started: 11/24/2025, 11:18:54 AM Last Modified: 11/24/2025, 11:18:54 AM Duration: 50s Working Directory: /Users/jacklynlee31 Usage: Total usage est: 0 Premium requests Total duration (API): 0s Total duration (wall): 50s Total code changes: 0 lines added, 0 lines removed Hit Enter or Esc to continue Directory and file access commands /add-dir: Allow Copilot to access a directory. By limiting Copilot’s access to the files you choose, you can ensure responses are relevant to your current scope and increase security. When to use: Scoping Copilot to a specific repository or subdirectory Navigating large codebases with sensitive files /add-dir <directory> For example, here I am adding the Documents directory to the allowed list for file access: /add-dir /Users/jacklynlee31/Documents Copilot then gives me the following output: Added directory to allowed list: /Users/jacklynlee31/Documents /list-dirs: Show allowed directories. This command helps keep file access transparent. This can help with team compliance policies. When to use: Verifying Copilot’s scope Troubleshooting access issues Reviewing permissions before running commands /list-dirs After running the command, Copilot will show you the list of directories. For example: Allowed directories for file access: 1. /Users/jacklynlee31 2. /Users/jacklynlee31/Documents Total: 2 directories /cwd: Show or change the working directory. This keeps Copilot focused on the part of your codebase you’re actively working in. When to use: Navigating complex project trees Switching between repositories Narrowing context for better suggestions /cwd For example, after using the command, Copilot gave me the following output: Current working directory: /Users/jacklynlee31/Downloads When using /cwd [directory], you are able to switch to a different directory: /cwd /Users/jacklynlee31/Downloads Copilot will give you a similar output to show the new working directory path: Changed working directory to: /Users/jacklynlee31/Downloads Configuration commands /model: Select an AI model. Copilot supports multiple models, but you don’t need to overthink it. Start with the default model, then experiment when you notice differences in speed, reasoning depth, or cost. When to use: Comparing outputs Testing new or preview models Troubleshooting unexpected responses /model After running the command, Copilot will display an interactive model selection menu similar to the following: Choose the AI model to use for Copilot CLI. The selected model will be persisted and used for future sessions. ❯ 1. Claude Sonnet 4.5 (1x) (default) (current) 2. Claude Opus 4.5 (Preview) (1x) 3. Claude Haiku 4.5 (0.33x) 4. Claude Sonnet 4 (1x) 5. GPT-5.1 (1x) 6. GPT-5.1-Codex-Mini (0.33x) 7. GPT-5.1-Codex (1x) 8. GPT-5 (1x) 9. GPT-5-Mini (0x) 10. GPT-4.1 (0x) 11. Gemini 3 Pro (Preview) (1x) 12. Cancel (Esc) You can select a model from the list with the number or arrow keys and press Enter. You can also use /model [model] to directly change the AI model. /theme [show|set|list] [auto|dark|light]: Configure the terminal theme. Show: shows the current theme preference. Set: used to set the terminal theme to auto, dark, or light. List: shows a list of available themes. When to use: Improving readability Matching team or environment standards /theme set dark After setting the theme, Copilot will confirm your preference and prompt you to restart the CLI to apply the new theme: ● Theme preference set to: dark The new theme will be applied on the next restart of the CLI. /terminal-setup: Enable multiline inputs. This is especially helpful for complex instructions or multi step code changes. This command ensures your terminal is ready for advanced tasks and collaborative workflows. When to use: Writing longer prompts Performing large refactors or reviews Improving prompt formatting during large code edits /terminal-setup /reset-allowed-tools: Reset tool permissions. This command helps you quickly roll back the allowed tools set to a clean slate, removing obsolete or risky items. When to use: After team or role changes Cleaning up after demos or experiments /reset-allowed-tools After using the command, Copilot will show a confirmation: The list of allowed tools has been reset. External services commands /agent: Select a custom agent. Custom agents let you target specialized tasks or integrations. When to use: Switching agent configurations by repository/org/project Testing specialized or third-party agents /delegate <prompt>: Create an AI-generated pull request. This lets you automate changes and create pull requests without leaving the terminal. When to use: Applying changes across multiple repositories Kicking off reviewable work quickly For example, here I generated a pull request in my repository to add dark mode support: /delegate Add dark mode support /share [file|gist] [path]: Export your session. Documentation is critical—and this command lets you capture entire session histories to share or archive. When to use: Async handoffs Documenting decisions or experiments Attaching context to issues or pull requests /share file /Users/jacklynlee31/Desktop After sharing the file, Copilot will confirm that the session was shared successfully to your chosen location: ● Session shared successfully to: /Users/jacklynlee31/Desktop/copilot-session-221b5571-3998-47e1-b57a-552cf9078947.md /login, /logout: Log in or out of Copilot. When to use: Rotating credentials Switching accounts on a shared device /mcp [show|add|edit|delete|disable|enable]: Manage MCP configurations. Managing MCP server configuration directly from the terminal means you don’t have to switch between tools or interfaces. Show: show the list of available MCP servers. Add: add a new MCP server. Edit: edit an existing MCP server. Delete: delete a MCP server. Disable: disable a MCP server. Enable: enable a MCP server. /user [show|list|switch]: Manage what GitHub account you’re using. Multi-user and enterprise development often means switching between accounts. /user can help you with your role-based workflows and testing. When to use: Multi-user machines Managing service accounts vs. personal accounts Rotating between organizations /user show /user list /user switch /help: Show all available commands. When to use: Discovering new features Quick reference while using the CLI Bringing it all together With slash commands in Copilot CLI, you can make common workflow tasks fast and repeatable. You’re gaining explicit control over context, scope, and automation without leaving the terminal. The best way to experience this is to dive in and try slash commands yourself. Start with /clear, /cwd, and /help. Then layer in others as your workflows grow. As slash command capabilities grow, your feedback helps us shape what comes next. Use /feedback to share what’s working, and what isn’t. Quick reference Slash commandWhat it doesWhen to use/clearClears session history/contextShift tasks, reset Copilot’s context, resolve confusion/exit, /quitExits the Copilot sessionFinish a session, reset the CLI/session, /usageShows current session and usage statsAudit activity, monitor Copilot CLI usage/add-dir <directory>Adds allowed directory for file accessLimit scope, improve security/auditing/list-dirsLists directories Copilot can accessConfirm or manage file access permissions/cwd [directory]Changes/outputs the working directoryNavigate projects, limit Copilot context/model [model]Changes Copilot AI model for the CLIExperiment, troubleshoot, optimize model behavior/theme [show|set|list]Manage terminal output themeCustomize for environment or team standards/reset-allowed-toolsResets allowed external toolsRemove tool permissions, reset for audits/agentSelects a custom Copilot agentWhen using specialized agents by repo/org/delegate <prompt>Delegates changes as a PR in a remote repositoryAutomate changes, multi-repo workflows/share [file|gist]Shares session as markdown or GitHub GistDocument sessions, async handoff, team sharing/login, /logoutSign in/out of Copilot in the CLIChange user, rotate credentials/mcp [show|add|edit|...]MCP server configuration managementUpdate CI/CD proxy config, enterprise setups/user [show|list|switch]GitHub user managementMulti-user or team CLI management/helpLists all CLI commands and shortcutsOnboarding, discoverability/feedbackSubmit feedback about Copilot CLIShare suggestions or bug reports with GitHub Try Copilot slash commands in the GitHub CLI and speed up your workflow. Install Copilot CLI or read the docs to get started. Additional resources Copilot feature page Colpilot CLI Copilot Chat Cookbook The post A cheat sheet to slash commands in GitHub Copilot CLI appeared first on The GitHub Blog.
Read more →

AI-supported vulnerability triage with the GitHub Security Lab Taskflow Agent

Triaging security alerts is often very repetitive because false positives are caused by patterns that are obvious to a human auditor but difficult to encode as a formal code pattern. But large language models (LLMs) excel at matching the fuzzy patterns that traditional tools struggle with, so we at the GitHub Security Lab have been experimenting with using them to triage alerts. We are using our recently announced GitHub Security Lab Taskflow Agent AI framework to do this and are finding it to be very effective. 💡 Learn more about it and see how to activate the agent in our previous blog post. In this blog post, we’ll introduce these triage taskflows, showcase results, and share tips on how you can develop your own—for triage or other security research workflows. By using the taskflows described in this post, we quickly triaged a large number of code scanning alerts and discovered many (~30) real-world vulnerabilities since August, many of which have already been fixed and published. When triaging the alerts, the LLMs were only given tools to perform basic file fetching and searching. We have not used any static or dynamic code analysis tools other than to generate alerts from CodeQL. While this blog post showcases how we used LLM taskflows to triage CodeQL queries, the general process creates automation using LLMs and taskflows. Your process will be a good candidate for this if: You have a task that involves many repetitive steps, and each one has a clear and well-defined goal. Some of those steps involve looking for logic or semantics in code that are not easy for conventional programming to identify, but are fairly easy for a human auditor to identify. Trying to identify them often results in many monkey patching heuristics, badly written regexp, etc. (These are potential sweet spots for LLM automation!) If your project meets those criteria, then you can create taskflows to automate these sweet spots using LLMs, and use MCP servers to perform tasks that are well suited for conventional programming. Both the seclab-taskflow-agent and seclab-taskflows repos are open source, allowing anyone to develop LLM taskflows to perform similar tasks. At the end of this blog post, we’ll also give some development tips that we’ve found useful. Introduction to taskflows Taskflows are YAML files that describe a series of tasks that we want to do with an LLM. In this way, we can write prompts to complete different tasks and have tasks that depend on each other. The seclab-taskflow-agent framework takes care of running the tasks one after another and passing the results from one task to the next. For example, when auditing CodeQL alert results, we first want to fetch the code scanning results. Then, for each result, we may have a list of tasks that we need to check. For example, we may want to check if an alert can be reached by an untrusted attacker and whether there are authentication checks in place. These become a list of tasks we specify in a taskflow file. We use tasks instead of one big prompt because LLMs have limited context windows, and complex, multi-step tasks often are not completed properly. Some steps are frequently left out, so having a taskflow to organize the task avoids these problems. Even with LLMs that have larger context windows, we find that taskflows are useful to provide a way for us to control and debug the task, as well as to accomplish bigger and more complex tasks. The seclab-taskflow-agent can also perform a batch “for loop”-style task asynchronously. When we audit alerts, we often want to apply the same prompts and tasks to every alert, but with different alert details. The seclab-taskflow-agent allows us to create templated prompts to iterate through the alerts and replace the details specific to each alert when running the task. Triaging taskflows from a code scanning alert to a report The GitHub Security Lab periodically runs a set of CodeQL queries against a selected set of open source repositories. The process of triaging these alerts is usually fairly repetitive, and for some alerts, the causes of false positives are usually fairly similar and can be spotted easily. For example, when triaging alerts for GitHub Actions, false positives often result from some checks that have been put in place to make sure that only repo maintainers can trigger a vulnerable workflow, or that the vulnerable workflow is disabled in the configuration. These access control checks come in many different forms without an easily identifiable code pattern to match and are thus very difficult for a static analyzer like CodeQL to detect. However, a human auditor with general knowledge of code semantics can often identify them easily, so we expect an LLM to be able to identify these access control checks and remove false positives. Over the course of a couple of months, we’ve tested our taskflows with a few CodeQL rules using mostly Claude Sonnet 3.5. We have identified a number of real, exploitable vulnerabilities. The taskflows do not perform an “end-to-end” analysis, but rather produce a bug report with all the details and conclusions so that we can quickly verify the results. We did not instruct the LLM to validate the results by creating an exploit nor provide any runtime environment for it to test its conclusion. The results, however, remain fairly accurate even without an automated validation step and we were able to remove false positives in the CodeQL queries quickly. The rules are chosen based on our own experience of triaging these types of alerts and whether the list of tasks can be formulated into clearly defined instructions for LLMs to consume. General taskflow design Taskflows generally consist of tasks that are divided into a few different stages. In the first stage, the tasks collect various bits of information relevant to the alert. This information is then passed to an auditing stage, where the LLM looks for common causes of false positives from our own experience of triaging alerts. After the auditing stage, a bug report is generated using the information gathered. In the actual taskflows, the information gathering and audit stage are sometimes combined into a single task, or they may be separate tasks, depending on how complex the task is. To ensure that the generated report has sufficient information for a human auditor to make a decision, an extra step checks that the report has the correct formatting and contains the correct information. After that, a GitHub Issue is created, ready to be reviewed. Creating a GitHub Issue not only makes it easy for us to review the results, but also provides a way to extend the analysis. After reviewing and checking the issues, we often find that there are causes for false positives that we missed during the auditing process. Also, if the agent determines that the alert is valid, but the human reviewer disagrees and finds that it’s a false positive for a reason that was unknown to the agent so far, the human reviewer can document this as an alert dismissal reason or issue comment. When the agent analyzes similar cases in the future, it will be aware of all the past analysis stored in those issues and alert dismissal reasons, incorporate this new intelligence in its knowledge base, and be more effective at detecting false positives. Information collection During this stage, we instruct the LLM (examples are provided in the Triage examples section below) to collect relevant information about the alert, which takes into account the threat model and human knowledge of the alert in general. For example, in the case of GitHub Actions alerts, it will look at what permissions are set in the GitHub workflow file, what are the events that trigger the GitHub workflow, whether the workflow is disabled, etc. These generally involve independent tasks that follow simple, well-defined instructions to ensure the information collected is consistent. For example, checking whether a GitHub workflow is disabled involves making a GitHub API call via an MCP server. To ensure that the information collected is accurate and to reduce hallucination, we instruct the LLM to include precise references to the source code that includes both file and line number to back up the information it collected: You should include the line number where the untrusted code is invoked, as well as the untrusted code or package manager that is invoked in the notes. Each task then stores the information it collected in audit notes, which are kind of a running commentary of an alert. Once the task is completed, the notes are serialized to a database which the next task can then append their notes to when it is done. In general, each of the information gathering tasks is independent of each other and does not need to read each other’s notes. This helps each task to focus on its own scope without being distracted by previously collected information. The end result is a “bag of information” in the form of notes associated with an alert that is then passed to the auditing tasks. Audit issue At this stage, the LLM goes through the information gathered and performs a list of specific checks to reject alert results that turned out to be false positives. For example, when triaging a GitHub Actions alert, we may have collected information about the events that trigger the vulnerable workflow. In the audit stage, we’ll check if these events can be triggered by an attacker or if they run in a privileged context. After this stage, a lot of the false positives that are obvious to a human auditor will be removed. Decision-making and report generation For alerts that have made it through the auditing stage, the next step is to create a bug report using the information gathered, as well as the reasoning for the decision at the audit stage. Again, in our prompt, we are being very precise about the format of the report and what information we need. In particular, we want it to be concise but also include information that makes it easy for us to verify the results, with precise code references and code blocks. The report generated uses the information gathered from the notes in previous stages and only looks at the source code to fetch code snippets that are needed in the report. No further analysis is done at this stage. Again, the very strict and precise nature of the tasks reduces the amount of hallucination. Report validation and issue creation After the report is written, we instruct the LLM to check the report to ensure that all the relevant information is contained in the report, as well as the consistency of the information: Check that the report contains all the necessary information: - This criteria only applies if the workflow containing the alert is a reusable action AND has no high privileged trigger. You should check it with the relevant tools in the gh_actions toolbox. If that's not the case, ignore this criteria. In this case, check that the report contains a section that lists the vulnerable action users. If there isn't any vulnerable action users and there is no high privileged trigger, then mark the alert as invalid and using the alert_id and repo, then remove the memcache entry with the key {{ RESULT_key }}. Missing or inconsistent information often indicates hallucinations or other causes of false positives (for example, not being able to track down an attacker controlled input). In either case, we dismiss the report. If the report contains all the information and is consistent, then we open a GitHub Issue to track the alert. Issue review and repo-specific knowledge The GitHub Issue created in the previous step contains all the information needed to verify the issue, with code snippets and references to lines and files. This provides a kind of “checkpoint” and a summary of the information that we have, so that we can easily extend the analysis. In fact, after creating the issue, we often find that there are repo-specific permission checks or sanitizers that render the issue a false positive. We are able to incorporate these problems by creating taskflows that review these issues with repo-specific knowledge added in the prompts. One approach that we’ve experimented with is to collect dismissal reasons for alerts in a repo and instruct the LLM to take into account these dismissal reasons and review the GitHub issue. This allows us to remove false positives due to reasons specific to a repo. In this case, the LLM is able to identify the alert as false positive after taking into account a custom check-run permission check that was recorded in the alert dismissal reasons. Triage examples and results In this section we’ll give some examples of what these taskflows look like in practice. In particular, we’ll show taskflows for triaging some GitHub actions and JavaScript alerts. GitHub Actions alerts The specific actions alerts that we triaged are checkout of untrusted code in a privileged context and code injection. The triaging of these queries shares a lot of similarities. For example, both involve checking the workflow triggering events, permissions of the vulnerable workflow, and tracking workflow callers. In fact, the main differences involve local analysis of specific details of the vulnerabilities. For code injection, this involves whether the injected code has been sanitized, how the expression is evaluated and whether the input is truly arbitrary (for example, pull request ID is unlikely to cause code injection issue). For untrusted checkout, this involves whether there is a valid code execution point after the checkout. Since many elements in these taskflows are the same, we’ll use the code injection triage taskflow as an example. Note that because these taskflows have a lot in common, we made heavy use of reusable features in the seclab-taskflow-agent, such as prompts and reusable tasks. When manually triaging GitHub Actions alerts for these rules, we commonly run into false positives because of: Vulnerable workflow doesn’t run in a privileged context. This is determined by the events that trigger the vulnerable workflow. For example, a workflow triggered by the pull_request_target runs in a privileged context, while a workflow triggered by the pull_request event does not. This can usually be determined by simply looking at the workflow file. Vulnerable workflow disabled explicitly in the repo. This can be checked easily by checking the workflow settings in the repo. Vulnerable workflow explicitly restricts permissions and does not use any secrets. In which case, there is little privilege to gain. Vulnerability specific issues, such as invalid user input or sanitizer in the case of code injection and the absence of a valid code execution point in the case of untrusted checkout. Vulnerable workflow is a reusable workflow but not reachable from any workflow that runs in privileged context. Very often, triaging these alerts involves many simple but tedious checks like the ones listed above, and an alert can be determined to be a false positive very quickly by one of the above criteria. We therefore model our triage taskflows based on these criteria. So, our action-triage taskflows consist of the following tasks during information gathering and the auditing stage: Workflow trigger analysis: This stage performs both information gathering and auditing. It first collects events that trigger the vulnerable workflow, as well as permission and secrets that are used in the vulnerable workflow. It also checks whether the vulnerable workflow is disabled in the repo. All information is local to the vulnerable workflow itself. This information is stored in running notes which are then serialized to a database entry. As the task is simple and involves only looking at the vulnerable workflow, preliminary auditing based on the workflow trigger is also performed to remove some obvious false positives. Code injection point analysis: This is another task that only analyzes the vulnerable workflow and combines information gathering and audit in a single task. This task collects information about the location of the code injection point, and the user input that is injected. It also performs local auditing to check whether a user input is a valid injection risk and whether it has a sanitizer. Workflow user analysis: This performs a simple caller analysis that looks for the caller of the vulnerable workflow. As it can potentially retrieve and analyze a large number of files, this step is divided into two main tasks that perform information gathering and auditing separately. In the information gathering task, callers of the vulnerable workflow are retrieved and their trigger events, permissions, use of secrets are recorded in the notes. This information is then used in the auditing task to determine whether the vulnerable workflow is reachable by an attacker. Each of these tasks is applied to the alert and at each step, false positives are filtered out according to the criteria in the task. After the information gathering and audit stage, our notes will generally include information such as the events that trigger the vulnerable workflow, permissions and secrets involved, and (in case of a reusable workflow) other workflows that use the vulnerable workflow as well as their trigger events, permissions, and secrets. This information will form the basis for the bug report. As a sanity check to ensure that the information collected so far is complete and consistent, the review_report task is used to check for missing or inconsistent information before a report is created. After that, the create_report task is used to create a bug report which will form the basis of a GitHub Issue. Before creating an issue, we double check that the report contains the necessary information and conforms to the format that we required. Missing information or inconsistencies are likely the results of some failed steps or hallucinations and we reject those cases. The following diagram illustrates the main components of the triage_actions_code_injection taskflow: We then create GitHub Issues using the create_issue_actions taskflow. As mentioned before, the GitHub Issues created contain sufficient information and code references to verify the vulnerability quickly, as well as serving as a summary for the analysis so far, allowing us to continue further analysis using the issue. The following shows an example of an issue that is created: In particular, we can use GitHub Issues and alert dismissal reasons as a means to incorporate repo-specific security measures and to further the analysis. To do so, we use the review_actions_injection_issues taskflow to first collect alert dismissal reasons from the repo. These dismissal reasons are then checked against the alert stated in the GitHub Issue. In this case, we simply use the issue as the starting point and instruct the LLM to audit the issue and check whether any of the alert dismissal reasons applies to the current issue. Since the issue contains all the relevant information and code references for the alert, the LLM is able to use the issue and the alert dismissal reasons to further the analysis and discover more false positives. The following shows an alert that is rejected based on the dismissal reasons: The following diagram illustrates the main components of the issue creation and review taskflows: JavaScript alerts Similarly to triaging action alerts, we also triaged code scanning alerts for the JavaScript/TypeScript languages to a lesser extent. In the JavaScript world, we triaged code scanning alerts for the client-side cross-site-scripting CodeQL rule. (js/xss) The client-side cross-site scripting alerts have more variety with regards to their sources, sinks, and data flows when compared to the GitHub Actions alerts. The prompts for analyzing those XSS vulnerabilities are focused on helping the person responsible for triage make an educated decision, not making the decision for them. This is done by highlighting the aspects that seem to make a given alert exploitable by an attacker and, more importantly, what likely prevents the exploitation of a given potential issue. Other than that, the taskflows follow a similar scheme as described in the GitHub Actions alerts section. While triaging XSS alerts manually, we’ve often identified false positives due to these reasons: Custom or unrecognized sanitization functions (e.g. using regex) that the SAST-tool cannot verify. Reported sources that are likely unreachable in practice (e.g., would require an attacker to send a message directly from the webserver). Untrusted data flowing into potentially dangerous sinks, whose output then is only used in an non-exploitable way. The SAST-tool not knowing the full context where the given untrusted data ends up. Based on these false positives, the prompts in the relevant taskflow or even in the active personality were extended and adjusted. If you encounter certain false positives in a project, auditing it makes sense to extend the prompt so that false positives are correctly marked (and also if alerts for certain sources/sinks are not considered a vulnerability). In the end, after executing the taskflows triage_js_ts_client_side_xss and create_issues_js_ts, the alert would result in GitHub issues such as: While this is a sample for an alert worthy of following up (which turned out to be a true positive, being exploitable by using a javascript: URL), alerts that the taskflow agent decided were false positive get their issue labelled with “FP” (for false positive): Taskflows development tips In this section we share some of our experiences when working on these taskflows, and what we think are useful in the development of taskflows. We hope that these will help others create their own taskflows. Use of database to store intermediate state While developing a taskflow with multiple tasks, we sometimes encounter problems in tasks that run at a later stage. These can be simple software problems, such as API call failures, MCP server bugs, prompt-related problems, token problems, or quota problems. By keeping tasks small and storing results of each task in a database, we avoided rerunning lengthy tasks when failure happens. When a task in a taskflow fails, we simply rerun the taskflow from the failed task and reuse the results from earlier tasks that are stored in the database. Apart from saving us time when a task failed, it also helped us to isolate effects of each task and tweak each task using the database created from the previous task as a starting point. Breaking down complex tasks into smaller tasks When we were developing the triage taskflows, the models that we used did not handle large context and complex tasks very well. When trying to perform complex and multiple tasks within the same context, we often ran into problems such as tasks being skipped or instructions not being followed. To counter that, we divided tasks into smaller, independent tasks. Each started with a fresh new context. This helped reduce the context window size and alleviated many of the problems that we had. One particular example is the use of templated repeat_prompt tasks, which loop over a list of tasks and start a new context for each of them. By doing this, instead of going through a list in the same prompt, we ensured that every single task was performed, while the context of each task was kept to a minimum. An added benefit is that we are able to tweak and debug the taskflows with more granularity. By having small tasks and storing results of each task in a database, we can easily separate out part of a taskflow and run it separately. Delegate to MCP server whenever possible Initially, when checking and gathering information, such as workflow triggers, from the source code, we simply incorporated instructions in prompts because we thought the LLM should be able to gather the information from the source code. While this worked most of the time, we also noticed some inconsistencies due to the non-deterministic nature of the LLM. For example, the LLM sometimes would only record a subset of the events that trigger the workflow, or it would sometimes make inconsistent conclusions about whether the trigger runs the workflow in a privileged context or not. Since these information and checks can easily be performed programmatically, we ended up creating tools in the MCP servers to gather the information and perform these checks. This led to a much more consistent outcome. By moving most of the tasks that can easily be done programmatically to MCP server tools while leaving the more complex logical reasoning tasks, such as finding permission checks for the LLM, we were able to leverage the power of LLM while keeping the results consistent. Reusable taskflow to apply tweaks across taskflows As we were developing the triage taskflows, we realized that many tasks can be shared between different triage taskflows. To make sure that tweaks in one taskflow can be applied to the rest and to reduce the amount of copy and paste, we needed to have some ways to refactor the taskflows and extract reusable components. We added features like reusable tasks and prompts. Using these features allowed us to reuse and apply changes consistently across different taskflows. Configuring models across taskflows As LLMs are constantly developing and new versions are released frequently, it soon became apparent that we need a way to update model version numbers across taskflows. So, we added the model configuration feature that allows us to change models across taskflows, which is useful when the model version needs updating or we just want to experiment and rerun the taskflows with a different model. Closing In this post we’ve shown how we created taskflows for the seclab-taskflow-agent to triage code scanning alerts. By breaking down the triage into precise and specific tasks, we were able to automate many of the more repetitive tasks using LLM. By setting out clear and precise criteria in the prompts and asking for precise answers from the LLM to include code references, the LLM was able to perform the tasks as instructed while keeping the amount of hallucination to a minimum. This allows us to leverage the power of LLM to triage alerts and reduces the amount of false positives greatly without the need to validate the alert dynamically. As a result, we were able to discover ~30 real world vulnerabilities from CodeQL alerts after running the triaging taskflows. The discussed taskflows are published in our repo and we’re looking forward to seeing what you’re going to build using them! More recently, we’ve also done some further experiments in the area of AI assisted code auditing and vulnerability hunting, so stay tuned for what’s to come! Get the guide to setting up the GitHub Security Lab Taskflow Agent > Disclaimers: When we use these taskflows to report vulnerabilities, our researchers review carefully all generated output before sending the report. We strongly recommend you do the same. Note that running the taskflows can result in many tool calls, which can easily consume a large amount of quota. The taskflows may create GitHub Issues. Please be considerate and seek the repo owner’s consent before running them on somebody else’s repo. The post AI-supported vulnerability triage with the GitHub Security Lab Taskflow Agent appeared first on The GitHub Blog.
Read more →

Context windows, Plan agent, and TDD: What I learned building a countdown app with GitHub Copilot

In our last Rubber Duck Thursdays stream of 2025, I wanted to build something celebratory. Something that captures what Rubber Duck Thursdays is all about: building together, learning from mistakes, and celebrating everyone who tunes in from across the world. Along the way, I picked up practical patterns for working with AI that you can apply to your own projects, whether you’re building a countdown app or something entirely different. From managing context windows to avoid cluttered conversations, to using the Plan agent for requirement discovery, to catching edge cases through test-driven development with Copilot. And… why world maps are harder than they look. 👀 See the full stream below. 👇 Starting simple: The basic countdown Countdown timers are a straightforward concept. Days countdown to hours. Minutes countdown to seconds. But sometimes it’s the simple ideas that allow us to be our most creative. I figured I’d use this as an opportunity to use Copilot in a spec or requirements-driven approach, to build a countdown app that brought anticipation and displayed fireworks as it turned to the new year. 💡What is spec-driven development? Instead of coding first and writing docs later, spec-driven development, you guessed it, starts with a spec. This is a contract for how your code should behave and becomes the source of truth your tools and AI agents use to generate, test, and validate code. The result is less guesswork, fewer surprises, and higher-quality code. Get started with our open source Spec Kit > Fortunately, software development is an iterative process and this livestream embraced that fully. While some requirements were well-defined, others evolved in real time, shaped by suggestions from our livestream audience. Custom agents like the Plan agent helped bridge the gap, turning ambiguous ideas into structured plans I could act on. So let’s start at the very beginning, setting up the project. I generated a new workspace with GitHub Copilot, using a very specific prompt. The prompt explained that we’re building a countdown app and that I wanted to use Vite, TypeScript, and Tailwind CSS v4. It also explained some of the requirements including the dark theme, centred layout, large bold digits with subtle animation, target midnight on January, 2026 by default, with some room for customizations. #new 1. Create a new workspace for a New Year countdown app using Vite, TypeScript, and Tailwind CSS v4. **Setup requirements:** - Use the @tailwindcss/vite plugin (Tailwind v4 style) - Dark theme by default (zinc-900 background) - Centered layout with the countdown as the hero element **Countdown functionality:** Create a `countdown.ts` module with: - A `CountdownTarget` type that has `{ name: string, date: Date }` so we can later customize what we're counting down to - A `getTimeRemaining(target: Date)` function returning `{ days, hours, minutes, seconds, total }` - A `formatTimeUnit(n: number)` helper that zero-pads to 2 digits - Default target: midnight on January 1st of NEXT year (calculate dynamically from current date) **Display:** - Large, bold countdown digits (use tabular-nums for stable width) - Labels under each unit (Days, Hours, Minutes, Seconds) - Subtle animation when digits change (CSS transition) - Below the countdown, show: "until [target.name]" (e.g., "until 2026") **Architecture:** - `src/countdown.ts` - pure logic, no DOM - `src/main.ts` - sets up the interval and updates the DOM - Use `requestAnimationFrame` or `setInterval` at 1 second intervals - Export types so they're reusable Keep it simple and clean—this is the foundation we'll build themes on top of. What I love about the “generate new workspace” feature is that Copilot generated custom instruction files for me, automatically capturing my requirements, including the countdown app, Vite, TypeScript, and dark theme. It was all documented before writing a single line of code. Within minutes, I had a working countdown. Days, hours, minutes, and seconds ticking down to 2026. While it worked, it wasn’t visually exciting. In fairness, I hadn’t specified any design or theme preferences in my initial prompt. So it was time to iterate and make it more interesting. The community suggestion that steered our course During the stream, viewers were joining from India, Nigeria, Italy, the United States (the list goes on!); developers from around the world, coming together to learn. One person in the chat made a suggestion that adjusted what we’d do next: What about time zones? It wasn’t a requirement I’d expected to work on during the stream, so I didn’t have a clear plan of how it would work. Maybe there is a globe that you could spin to select timezones. Maybe there was a world map with a time travel theme. That’s a lot of maybes. My requirements were vague, which was where I turned to the Plan agent. Plan agent: The questions I hadn’t thought to ask I’ve been using Plan agent more deliberately lately, especially when I feel that my requirements aren’t fully defined. The Plan agent doesn’t create a plan based on my initial prompt, it asks clarifying questions that can reveal edge cases you may not have considered. I gave it my rough idea: interactive time zone selector, time travel theme, animate between zones, maybe a world map. The Plan agent came back with questions that made me think: QuestionWhy it matteredShould the circular dial be primary with the world map as secondary, or vice versa?I hadn’t decided the visual hierarchyWhat happens on mobile: dropdown fallback or touch-friendly scroll?I was only thinking of a desktop implementation for this initial version. Mobile could be a future requirement.When a time zone passes midnight, show “already celebrating” with confetti, or a timer showing how long since midnight?I wanted the celebration, not a reverse countdown. I wasn’t clear on my requirements.Would there be subtle audio feedback when spinning the dial, or visual only?Bringing audio into the app was scope creep, but it could be a future requirement. This is the beauty of working with AI in this way. The Plan agent makes you think, potentially asking a clarifying question and offering options A or B. But as you reflect, you realize the answer is somewhere in between. For example, in my second iteration of requirements, the plan asked whether fireworks should run continuously, burst once, or loop subtly. I replied that there’s probably a performance consideration, and we should opt for somewhere in the middle. We also asked the livestream viewers to vote on whether we should implement the component as a dial or map. Map won, so we pivoted to a world map as the primary selector with eight featured locations. Context window management: Just keep what you need Before implementing, I deliberately started a new chat session. The context from our previous conversation (workspace creation, basic countdown logic) wasn’t needed anymore. And any context that might have been useful was now included in our custom instructions file. When working with AI tools, that context window is precious. Bringing in irrelevant history clutters the conversation and dilutes focus. So I cleared it, bringing only what mattered: the new requirements, the Plan agent output (which I’d asked Copilot to write to a separate Markdown file), and fresh focus on time zones. I also reused some custom instruction files, custom agents, and prompt files from another personal project to help steer Copilot in the right direction, and incorporate specialized agents for relevant tasks. This included a UI Performance Specialist agent. 💡 Did you know? GitHub Copilot’s custom agents let you create specialised personas for different development tasks. The UI Performance Specialist agent that I built during the stream is just one example. You can create agents for security reviews, architecture planning, or any role-specific workflow. The awesome-copilot repository has a number of examples. Implementation: Modular, test-driven, and that map With the Plan agent’s work complete, I switched to my UI Performance Specialist agent and asked it to review the plan, suggesting deeper implementation details based on its expertise. Context is important here, so I didn’t create a new conversation. Instead, I continued the existing one. The agent came back with a detailed set of considerations: Frame time budgets for animations Map SVG size optimisation strategies Celebration particle limits (DOM element concerns) and cleanup considerations Animation property recommendations (transform/opacity only) Reduced motion support It looked good, but I added a couple of additional requirements. I asked the custom agent to make the implementation modular, to write the tests first based on expected behaviour, and once it had failing tests, to write the implementation. That’s right: test-driven development with Copilot. The TDD Cycle Copilot created test files for time zone utilities, city state management, and the countdown logic. All failing tests in a red state. Good (one of the few times where we want to see failing tests)! Then it implemented: Time zone utilities using the Intl.DateTimeFormat API City state with featured locations (New York, London, Tokyo, Sydney, etc.) localStorage persistence for selected time zones App state management With access to tools, the custom agent also executed tests in the terminal. Two test cases failed: the logic that determined whether the celebration was being triggered correctly between year rollovers. The tests were expecting that celebrations were handled at midnight, and the duration since the celebrations began. Since Copilot had access to the output, the custom agent caught the test failures, adjusted the timezone implementation, and the tests went green. 💡 Thought: This is exactly why TDD and thinking about code quality matters. Just like us developers, AI-assisted development can get things wrong. Tests help us catch bugs before users do. The year rollover edge case would have been embarrassing to discover on December 31, given that it was the core capability of the app! But some bugs turn into features. I found one bug too funny to fix immediately. Let’s talk about the world map. The World map, maybe? When I opened the app, the countdown worked. The time zone selector worked. The calculations were correct, and switching from New York to Tokyo showed the proper time difference. But the world map? It didn’t quite render as expected. What appeared on screen was more abstract art than geography. But it really made me laugh on stream. 💡 Thought: I was ambitious specifying a world map without providing enough context. No SVG asset, no reference to an existing mapping library. Just “add a mini world map.” A reminder that AI can get things wrong. Could I have fixed it? Absolutely. But we were over an hour into the stream, and had more features to build. So I left it. The map was a perfect example of iterative development where things don’t always go right the first time. (Can you tell that we build things live yet?) Fireworks: Building anticipation toward midnight A countdown on its own is functional, but fireworks add celebration and give some visual flare (See what I did there?). I switched back to the Plan agent and created a new chat thread (again, context window management, prompting Copilot to build out a plan): Use Fireworks.js for the effects Set the fireworks behaviour based on time remaining If the timer has more than 24 hours left, don’t display fireworks, just show ambient stars If the timer has between 24 to 12 hours remaining, set off fireworks every 30 seconds Between one hour and 10 minutes remaining, the intensity of the fireworks should build And finally, in the last 10 seconds we should have continuous fireworks for maximum celebration I also asked for a skyline silhouette at the bottom of the screen, a dark night sky gradient, and a theme controller. Plus, one critical testing requirement: “Add a query parameter so I can specify how many minutes away we are from midnight as an override for manual testing.” While I enjoy streaming with our community, I’m not sure that everyone would have enjoyed hanging around until the turn of 2026 to see the results! The Plan agent asked for further clarification on how to display the stars (either setting them as CSS, or setting them as low-intensity fireworks), as well as some considerations around performance. It also asked about toggle placement, which caught me out. I didn’t remember asking for a toggle button and may have missed that in an iteration of the plan. After carefully reviewing the plan, the Plan agent that I originally requested an animation toggle for accessibility. This is why I like the Plan agent. It’s rubber ducking with AI that has the context of your conversation, and can check whether those requirements still make sense. Once Copilot and I renegotiated the requirements, we used that familiar test-driven development approach. One test failed initially as the JSDOM environment setup was missing. Copilot spotted the failure, identified the misconfigured testing configuration, and made the fix. After that, all tests went green. We now had an app with fireworks at different intensity levels, an animated starfield using CSS, a city skyline, reduced motion support, and a query parameter override. Testing the Intensity Levels I added ?minutesToMidnight=1 to the URL. Fireworks appeared with medium intensity, building excitement with increasing amounts of colors and particles across the sky. At “midnight,” Happy New Year appeared with even more celebration. The intensity curve felt right, the buildup created anticipation and the finale delivered. Reveal: What I built that morning But I didn’t stop there. Throughout the stream, I’d been teasing that I’d made another countdown app earlier that morning, something with a very relevant theme. Our viewers guessed another fireworks countdown, a confetti timer, and even an “elicitation-powered tic-tac-toe” (which, to be fair, we have built before). But as a GitHub stream, there was only one way that we could finish it off. We had to have a contribution graph themed countdown! The countdown sat in the centre in front of an animated contribution graph. Each square flickered with green contributions appearing and disappearing across the grid in waves. And just like the fireworks theme, as the countdown ticked closer to zero, more squares lit up and the intensity built. This stream was a celebration. A way to bring our community together across time zones, all of us building and counting down to the same moment in our own corners of the world. During the stream, someone asked about the best programming languages for landing jobs. My answer was the same as my approach to this project: find the thing that brings you joy, and then the right tools and languages just fall into place. I built this GitHub countdown theme because it brought me joy. Because I wanted to make something “GitHubby,” and because I enjoy building visual experiences. Since that stream, I’ve worked on bringing these two projects into a unified open source countdown app, Timestamp. It has a centralized theme orchestrator, allowing developers to plug into a common architecture and extend with new themes. Every countdown is a URL so can be easily shared, and there are several countdown modes to choose from (local time, absolute moments and timers). You can check out the live app and review the codebase. You’re welcome to take a look at the repository, star it, fork it, and even contribute a new theme. I hope this inspires you to build that one project that has been on the backlog, and spend some time on the thing that brings you a little bit of joy. What have we learned? Context window management is a skill. Start new chat sessions when old context isn’t needed. Keep conversations focused. It’s context engineering, not just prompt engineering. The Plan agent asks questions you may have forgotten. Use it when requirements are vague. Let it reveal edge cases through clarifying questions. Sometimes the answer to A or B is “somewhere in the middle.” Custom agents are specialised helpers. My UI Performance Specialist had expertise in frame budgets, animation properties, and accessibility. It gave implementation details while the plan agent helped ask clarifying questions to determine the scope. Specialisation matters. TDD with Copilot works. Write tests first. Let them fail. Implement to pass. Just like us developers, AI-assisted tools produce bugs. We need to use those same practices that we’re used to for checking quality (builds, linters, and tests) to catch issues before users do. Things won’t always work the first time. That’s okay. The world map didn’t render as expected, and I left it that way until my significant refactor and rebuild of the countdown app. Authentic development means showing the messy middle, not just polished outcomes. We learn from unexpected results as much as from successes. Scope ambitiously, implement iteratively. We went from basic countdown, to time zones, to intense fireworks, to a separate contribution graph themed countdown. Rome wasn’t built in a day, and you don’t need everything on day one. What will you build in 2026? Drop by the next Rubber Duck Thursdays stream at 10:30 a.m. UK time and 2:00 p.m. Eastern time, and let’s build something that brings us joy, which hasn’t quite reached the top of the “some day” list! The post Context windows, Plan agent, and TDD: What I learned building a countdown app with GitHub Copilot appeared first on The GitHub Blog.
Read more →

★ Crazy People Do Crazy Things

Donald Trump, in a message (I wouldn’t call it a letter) sent to Norwegian Prime Minister Jonas Gahr Støre, confirmed by several news organizations: Dear Jonas: Considering your Country decided not to give me the Nobel Peace Prize for having stopped 8 Wars PLUS, I no longer feel an obligation to think purely of Peace, although it will always be predominant, but can now think about what is good and proper for the United States of America. Denmark cannot protect that land from Russia or China, and why do they have a “right of ownership” anyway? There are no written documents, it’s only that a boat landed there hundreds of years ago, but we had boats landing there, also. I have done more for NATO than any other person since its founding, and now, NATO should do something for the United States. The World is not secure unless we have Complete and Total Control of Greenland. Thank you! President DJT There’s a simple explanation for this. Trump is in cognitive decline and it’s accelerating from age-related dementia. He lives in an imaginary world that is increasingly cleaved from reality. (Norway, it should be pointed out, is not Denmark, the country of which Greenland is a part.) Trump’s Venezuela operation was brazenly illegal. But it wasn’t crazy. Venezuela was not a U.S. ally. President Nicolas Maduro lost an election but stayed in power. Venezuela was producing military drones for the hostile regime in Iran, a self-declared enemy of the U.S., NATO, and Israel. Venezuela had a burgeoning alliance with China, the U.S.’s primary geopolitical rival. What Trump is threatening with Greenland is simply bonkers. Greenland is under no threat from China or Russia because it’s part of NATO, and thus — ostensibly — under the full protection of the entire NATO alliance including and especially the United States. If China or Russia attempted to take Greenland it would trigger a world war led by the United States. Compare and contrast with Ukraine and Taiwan. Ukraine, long before Vladimir Putin invaded, was known to be under threat of Russian invasion. Taiwan has long been known to be threatened by China. These threats have been in our geopolitical discourse for decades because the threats were real (and, unfortunately, came to pass in Ukraine). No one has ever talked about Greenland being under threat of takeover by Russia or China because there is no such threat. It’s no more realistic than Russia taking over Alaska or China taking over Hawaii. It sounds nuts because it is nuts, and the threat only exists in Trump’s disintegrating mind. Eight of our NATO allies have made clear, through action, not mere words, their intention to defend Greenland. Trump, obviously angry that our ostensible allies won’t just roll over and accede to his madness, is now petulantly turning to his favorite word, tariffs. If that’s “the hard way”, that’s pathetic. Stand up to bullies and they usually fold. The threat to Greenland, and thus to NATO — and thus, quite literally, to the entire world — is not that Trump authorized an illegal military operation in Venezuela, so he might do it in Greenland too. Again, what the U.S. did in Venezuela was obviously illegal, and probably stupid, but it wasn’t crazy. Breaking up NATO and starting a war with Europe would be batshit crazy. The threat is that Trump is showing us, every day, that he is crazy. Crazy people do crazy things, and crazy cult leaders surround themselves with cultists. The rest of us need to stop sane-washing this. You cannot make sense out of nonsense. If Trump declares that the U.S. is laying claim to all of the green cheese on the moon — say, to lower the price of dairy groceries — the news media should not respond with fact-finding articles with headlines like “How Much Cheese Is on the Moon?” They should respond with headlines like “How Many Marbles Are Left in Trump’s Dementia-Addled Head?” But threatening to take Greenland by military force is nuttier than laying claim to the moon’s cheese. Laying claim to non-existent green cheese wouldn’t trigger a shooting war that blows apart the most powerful alliance in military history.
Read more →

Building an agentic memory system for GitHub Copilot

Our vision is to evolve GitHub Copilot into an ecosystem of agents that collaborate across the entire development lifecycle from coding and code review to security, debugging, deployment, and maintenance. To unlock the full potential of multi-agent workflows, we need to move beyond isolated interactions—that start from scratch each session—and toward a cumulative knowledge base that grows with every use. Cross-agent memory allows agents to remember and learn from experiences across your development workflow, without relying on explicit user instructions. Each interaction teaches Copilot more about your codebase and conventions, making it increasingly effective over time. For example, if Copilot coding agent learns how your repository handles database connections as it’s fixing a security vulnerability, Copilot code review can then use that knowledge to spot inconsistent patterns in future pull requests. Or if Copilot code review notices that certain files must stay synchronized, in the future Copilot coding agent will automatically update them together when generating new code. Where memory works today in GitHub Copilot (public preview) Copilot’s new memory system is available in public preview, starting with Copilot coding agent, Copilot CLI, and Copilot code review for all paid Copilot plans, with other agents to follow shortly (learn about how it works in our Docs). It’s off by default and fully opt-in, so you decide when and where Copilot should start learning from your workflows. You can turn on memory in your GitHub Copilot settings. Learn how to enable memory in our Docs > The challenge: What to remember and when to forget Our agents continuously improve at extracting the context needed for specific tasks. The core challenge for memory systems isn’t about information retrieval, but ensuring that any stored knowledge remains valid as code evolves across branches and time. In practice, this means a memory system must handle changes to code, abandoned branches, and conflicting observations—all while ensuring that agents only act on information that’s relevant to the current task and code state. For example, a logging convention observed in one branch may later be modified, superseded, or never merged at all. One option would be to implement an offline curation service to deduplicate, resolve conflicts, track branch status, and expire stale information. At GitHub’s scale, however, such an approach would introduce significant engineering complexity and LLM costs, while still requiring mechanisms to reconcile changes at read time. We started by exploring a simpler, more efficient approach. Our solution: just-in-time verification Information retrieval is an asymmetrical problem: It’s hard to solve, but easy to verify. By using real-time verification, we gain the power of pre-stored memories while avoiding the risk of outdated or misleading information. Instead of offline memory curation, we store memories with citations: references to specific code locations that support each fact. When an agent encounters a stored memory, it verifies the citations in real-time, validating that the information is accurate and relevant to the current branch before using it. This verification boils down to a small number of simple read operations, adding no significant latency to agent sessions in our testing. Memory creation as a tool call We implemented memory creation as a tool that agents can invoke when they discover something that’s likely to have actionable implications for future tasks. How Copilot agents store learnings worth remembering as they carry out their tasks. Consider this example: While reviewing a pull request from an experienced developer, Copilot code review discovers that API version tracking must stay synchronized across different parts of a codebase. It might encounter these three updates in the same pull request: In src/client/sdk/constants.ts: export const API_VERSION = "v2.1.4"; In server/routes/api.go: const APIVersion = "v2.1.4" In docs/api-reference.md: Version: v2.1.4 In response, Copilot code review can invoke the memory storage tool to create a memory like this: { subject: "API version synchronization", fact: "API version must match between client SDK, server routes, and documentation.", citations: ["src/client/sdk/constants.ts:12", "server/routes/api.go:8", "docs/api-reference.md:37"], reason: "If the API version is not kept properly synchronized, the integration can fail or exhibit subtle bugs. Remembering these locations will help ensure they are kept syncronized in future updates." } The result: The next time an agent updates the API version in any of these locations, it will see this memory and realize that it must update the other locations too, preventing a versioning mismatch that could break integrations. Similarly, if an inexperienced developer opens a pull request that updates only one of these locations, Copilot code review will flag the omission and suggest the missing updates, automatically transferring knowledge from a more experienced team member to a newer one. 💥 Memory usage Retrieval When an agent starts a new session, we retrieve the most recent memories for the target repository and include them in the prompt. Future implementations will enable additional retrieval techniques, such as a search tool and weighted prioritization. How Copilot enriches agent prompts with memories from previous tasks. Verification Before applying any memory, the agent is prompted to verify its accuracy and relevance by checking the cited code locations. If the code contradicts the memory, or if the citations are invalid (e.g. point to nonexistent locations), the agent is encouraged to store a corrected version of the memory reflecting the new evidence. If the citations check out and the memory is deemed useful, the agent is encouraged to store it again in order to refresh its timestamp. Privacy and security It’s important to note that memories are tightly scoped. Memories for a given repository can only be created in response to actions taken within that repository by contributors with write permissions, and can only be used in tasks on that same repository initiated by users with read permissions. Much like the source code itself, memories about a repository stay within that repository, ensuring privacy and security. Cross-agent memory sharing The full power of our memory system emerges as different Copilot agents learn from one another. Copilot code review discovers a logging convention while reviewing a pull request: “Log file names should follow pattern ‘app-YYYYMMDD.log’. Use Winston for logging with format: timestamp, error code, user ID.” Copilot coding agent is later assigned a task to implement a new microservice. It sees and verifies the memory and automatically applies the same logging format. Copilot CLI helps a developer debug an issue, efficiently retrieving the correct log file and finding the relevant timestamps based on the logging format learned by the code review agent. Each agent contributes to and benefits from the shared knowledge base, allowing agents to reuse validated repository knowledge across tasks. As additional agents adopt memory—whether for development workflows, debugging, or security analysis—they’ll contribute to and benefit from the same evolving understanding of your codebase. Evaluation Stress-testing agent resilience Our biggest concern was the impact of outdated, incorrect, or even maliciously injected memories. To test the system’s resilience, we deliberately seeded repositories with adversarial memories–facts that contradicted the codebase–with citations pointing to irrelevant or nonexistent code locations. Across all test cases, agents consistently verified citations, discovered contradictions, and updated incorrect memories. The memory pool self-healed as agents stored corrected versions based on their observations. The citation verification mechanism robustly prevented the risk of misleading memories. Simulating a realistic memory pool For each repository in our evaluation set, we ran agents on diverse historical tasks (predating our target evaluation tasks) and let them populate the memory database organically, using the “store_memory” tool we provided. To simulate worst-case conditions, we overrepresented memories from branches that were abandoned or closed without merging, ensuring realistically noisy memories. When we ran Copilot code review on the pull requests in our evaluation set, memory usage led to 3% increase in precision and 4% increase in recall. Measuring impact on developers The ultimate test of our memory system was its impact on real developers in their everyday workflows. We ran A/B tests on the first two Copilot agents to deploy memory, Copilot code review and Copilot coding agent, measuring the impact on key user metrics. Copilot coding agent: 7% increase in pull request merge rates (90% with memories vs. 83% without). This means developers are saving more time and getting the desired results more often when they assign tasks to Copilot. Copilot code review: 2% increase in positive feedback on comments (77% with memories vs 75% without). This means automated code review is yielding improved quality assurance. Both increases are highly statistically significant, with p-value <0.00001 These results demonstrate that cross-agent memory delivers measurable value to developers in their daily workflows. What’s next We’ve deployed repository-scoped memory storage and usage in Copilot CLI, Copilot coding agent, and Copilot code review on an opt-in basis. We’re listening to user feedback and tracking performance metrics closely as we iterate and prepare for a wider rollout across more Copilot workflows. We’re also exploring a range of approaches to tuning memory generation, curation, prioritization, and usage. Cross-agent memory reduces the need to re-establish context at the start of each task by allowing validated information to persist across agentic workflows. We’re excited about the possibilities memory will unlock, and we’re just getting started. We look forward to your feedback so we can ensure GitHub Copilot continues to evolve in ways that best support your needs. Happy coding! Read our Docs to learn how to enable memory in Copilot > The post Building an agentic memory system for GitHub Copilot appeared first on The GitHub Blog.
Read more →

Stop Picking Sides: Manage the Tension Between Adaptation and Optimization

Jim Highsmith notes that many teams have turned into tribes wedded to exclusively adaptation or optimization. But he feels this misses the point that both of these are important, and we need to manage the tension between them. We can do this by thinking of two operating modes: explore (adaptation-dominant) and exploit (optimization dominant). We tailor a team's operating model to a particular blend of the two - considering uncertainty, risk, cost of change, and an evidence threshold. We should be particularly careful at the points where there is a handoff between the two modes more…
Read more →

My favorite musical discoveries of 2025

My favorite albums from last year. Balkan brass, an acoustic favorite of 80s returns, Ethio-jazz, Guatemalan singer-guitarist, jazz-rock/Indian classical fusion, and a unique male vocalist. more…
Read more →

Fragments: January 8

Anthropic report on how their AI is changing their own software development practice. Most usage is for debugging and helping understand existing code Notable increase in using it for implementing new features Developers using it for 59% of their work and getting 50% productivity increase 14% of developers are “power users” reporting much greater gains Claude helps developers to work outside their core area Concerns about changes to the profession, career evolution, and social dynamics ❄ ❄ ❄ ❄ ❄ Much of the discussion about using LLMs for software development lacks details on workflow. Rather than just hear people gush about how wonderful it is, I want to understand the gritty details. What kinds of interactions occur with the LLM? What decisions do the humans make? When reviewing LLM outputs, what kinds of things are the humans looking for, what corrections do they make? Obie Fernandez has written a post that goes into these kinds of details. Over the Christmas / New Year period he used Claude to build a knowledge distillation application, that takes transcripts from Claude Code sessions, slack discussion, github PR threads etc, turns them into an RDF graph database, and provides a web app with natural language ways to query them. Not a proof of concept. Not a demo. The first cut of Nexus, a production-ready system with authentication, semantic search, an MCP server for agent access, webhook integrations for our primary SaaS platforms, comprehensive test coverage, deployed, integrated and ready for full-scale adoption at my company this coming Monday. Nearly 13,000 lines of code. The article is long, but worth the time to read it. An important feature of his workflow is relying on Test-Driven Development Here’s what made this sustainable rather than chaotic: TDD. Test-driven development. For most of the features, I insisted that Claude Code follow the red-green-refactor cycle with me. Write a failing test first. Make it pass with the simplest implementation. Then refactor while keeping tests green. This wasn’t just methodology purism. TDD served a critical function in AI-assisted development: it kept me in the loop. When you’re directing thousands of lines of code generation, you need a forcing function that makes you actually understand what’s being built. Tests are that forcing function. You can’t write a meaningful test for something you don’t understand. And you can’t verify that a test correctly captures intent without understanding the intent yourself. The account includes a major refactoring, and much evolution of the initial version of the tool. It’s also an interesting glimpse of how AI tooling may finally make RDF useful. ❄ ❄ ❄ ❄ ❄ When thinking about requirements for software, most discussions focus on prioritization. Some folks talk about buckets such as the MoSCoW set: Must, Should, Could, and Want. (The old joke being that, in MoSCoW, the cow is silent, because hardly any requirements end up in those buckets.) Jason Fried has a different set of buckets for interface design: Obvious, Easy, and Possible. This immediately resonates with me: a good way of think about how to allocate the cognitive costs for those who use a tool. ❄ ❄ ❄ ❄ ❄ Casey Newton explains how he followed up on an interesting story of dark patterns in food delivery, and found it to be a fake story, buttressed by AI image and document creation. On one hand, it clarifies the important role reporters play in exposing lies that get traction on the internet. But time taken to do this is time not spent on investigating real stories For most of my career up until this point, the document shared with me by the whistleblower would have seemed highly credible in large part because it would have taken so long to put together. Who would take the time to put together a detailed, 18-page technical document about market dynamics just to troll a reporter? Who would go to the trouble of creating a fake badge? Today, though, the report can be generated within minutes, and the badge within seconds. And while no good reporter would ever have published a story based on a single document and an unknown source, plenty would take the time to investigate the document’s contents and see whether human sources would back it up. The internet has always been full of slop, and we have always needed to be wary of what we read there. AI now makes it easy to manufacture convincing looking evidence, and this is never more dangerous than when it confirms strongly held beliefs and fears. ❄ ❄ ❄ ❄ ❄ Kent Beck: The descriptions of Spec-Driven development that I have seen emphasize writing the whole specification before implementation. This encodes the (to me bizarre) assumption that you aren’t going to learn anything during implementation that would change the specification. I’ve heard this story so many times told so many ways by well-meaning folks–if only we could get the specification “right”, the rest of this would be easy. Like him, that story has been the constant background siren to my career in tech. But the learning loop of experimentation is essential to the model building that’s at the heart of any kind of worthwhile specification. As Unmesh puts it: Large Language Models give us great leverage—but they only work if we focus on learning and understanding. They make it easier to explore ideas, to set things up, to translate intent into code across many specialized languages. But the real capability—our ability to respond to change—comes not from how fast we can produce code, but from how deeply we understand the system we are shaping. When Kent defined Extreme Programming, he made feedback one of its four core values. It strikes me that the key to making the full use of AI in software development is how to use it to accelerate the feedback loops. ❄ ❄ ❄ ❄ ❄ As I listen to people who are serious with AI-assisted programming, the crucial thing I hear is managing context. Programming-oriented tools are geting more sophisticated for that, but there’s also efforts at providing simpler tools, that allow customization. Carlos Villela recently recommended Pi, and its developer, Mario Zechner, has an interesting blog on its development. So what’s an old guy yelling at Claudes going to do? He’s going to write his own coding agent harness and give it a name that’s entirely un-Google-able, so there will never be any users. Which means there will also never be any issues on the GitHub issue tracker. How hard can it be? If I ever get the time to sit and really play with these tools, then something like Pi would be something I’d like to try out. Although as an addict to The One True Editor, I’m interested in some of libraries that work with that, such as gptel. That would enable me to use Emacs’s inherent programability to create my own command set to drive the interaction with LLMs. ❄ ❄ ❄ ❄ ❄ Outside of my professional work, I’ve posting regularly about my boardgaming on the specialist site BoardGameGeek. However its blogging environment doesn’t do a good job of providing an index to my posts, so I’ve created a list of my BGG posts on my own site. If you’re interested in my regular posts on boardgaming, and you’re on BGG you can subscribe to me there. If you’re not on BGG you can subscribe to the blog’s RSS feed. I’ve also created a list of my favorite board games.
Read more →

Fragments: December 16

Gitanjali Venkatraman does wonderful illustrations of complex subjects (which is why I was so happy to work with her on our Expert Generalists article). She has now published the latest in her series of illustrated guides: tackling the complex topic of Mainframe Modernization In it she illustrates the history and value of mainframes, why modernization is so tricky, and how to tackle the problem by breaking it down into tractable pieces. I love the clarity of her explanations, and smile frequently at her way of enhancing her words with her quirky pictures. ❄ ❄ ❄ ❄ ❄ Gergely Orosz on social media Unpopular opinion: Current code review tools just don’t make much sense for AI-generated code When reviewing code I really want to know: The prompt made by the dev What corrections the other dev made to the code Clear marking of code AI-generated not changed by a human Some people pushed back saying they don’t (and shouldn’t care) whether it was written by a human, generated by an LLM, or copy-pasted from Stack Overflow. In my view it matters a lot - because of the second vital purpose of code review. When asked why do code reviews, most people will answer the first vital purpose - quality control. We want to ensure bad code gets blocked before it hits mainline. We do this to avoid bugs and to avoid other quality issues, in particular comprehensibility and ease of change. But I hear the second vital purpose less often: code review is a mechanism to communicate and educate. If I’m submitting some sub-standard code, and it gets rejected, I want to know why so that I can improve my programming. Maybe I’m unaware of some library features, or maybe there’s some project-specific standards I haven’t run into yet, or maybe my naming isn’t as clear as I thought it was. Whatever the reasons, I need to know in order to learn. And my employer needs me to learn, so I can be more effective. We need to know the writer of the code we review both so we can communicate our better practice to them, but also to know how to improve things. With a human, its a conversation, and perhaps some documentation if we realize we’ve needed to explain things repeatedly. But with an LLM it’s about how to modify its context, as well as humans learning how to better drive the LLM. ❄ ❄ ❄ ❄ ❄ Wondering why I’ve been making a lot of posts like this recently? I explain why I’ve been reviving the link blog. ❄ ❄ ❄ ❄ ❄ Simon Willison describes how he uses LLMs to build disposable but useful web apps These are the characteristics I have found to be most productive in building tools of this nature: A single file: inline JavaScript and CSS in a single HTML file means the least hassle in hosting or distributing them, and crucially means you can copy and paste them out of an LLM response. Avoid React, or anything with a build step. The problem with React is that JSX requires a build step, which makes everything massively less convenient. I prompt “no react” and skip that whole rabbit hole entirely. Load dependencies from a CDN. The fewer dependencies the better, but if there’s a well known library that helps solve a problem I’m happy to load it from CDNjs or jsdelivr or similar. Keep them small. A few hundred lines means the maintainability of the code doesn’t matter too much: any good LLM can read them and understand what they’re doing, and rewriting them from scratch with help from an LLM takes just a few minutes. His repository includes all these tools, together with transcripts of the chats that got the LLMs to build them. ❄ ❄ ❄ ❄ ❄ Obie Fernandez: while many engineers are underwhelmed by AI tools, some senior engineers are finding them really valuable. He feels that senior engineers have an oft-unspoken mindset, which in conjunction with an LLM, enables the LLM to be much more valuable. Levels of abstraction and generalization problems get talked about a lot because they’re easy to name. But they’re far from the whole story. Other tools show up just as often in real work: A sense for blast radius. Knowing which changes are safe to make loudly and which should be quiet and contained. A feel for sequencing. Knowing when a technically correct change is still wrong because the system or the team isn’t ready for it yet. An instinct for reversibility. Preferring moves that keep options open, even if they look less elegant in the moment. An awareness of social cost. Recognizing when a clever solution will confuse more people than it helps. An allergy to false confidence. Spotting places where tests are green but the model is wrong. ❄ ❄ ❄ ❄ ❄ Emil Stenström built an HTML5 parser in python using coding agents, using Github Copilot in Agent mode with Claude Sonnet 3.7. He automatically approved most commands. It took him “a couple of months on off-hours”, including at least one restart from scratch. The parser now passes all the tests in html5lib test suite. After writing the parser, I still don’t know HTML5 properly. The agent wrote it for me. I guided it when it came to API design and corrected bad decisions at the high level, but it did ALL of the gruntwork and wrote all of the code. I handled all git commits myself, reviewing code as it went in. I didn’t understand all the algorithmic choices, but I understood when it didn’t do the right thing. Although he gives an overview of what happens, there’s not very much information on his workflow and how he interacted with the LLM. There’s certainly not enough detail here to try to replicate his approach. This is contrast to Simon Willison (above) who has detailed links to his chat transcripts - although they are much smaller tools and I haven’t looked at them properly to see how useful they are. One thing that is clear, however, is the vital need for a comprehensive test suite. Much of his work is driven by having that suite as a clear guide for him and the LLM agents. JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn’t have written it this quickly without the agent. But “quickly” doesn’t mean “without thinking.” I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking. ❄ ❄ Then Simon Willison ported the library to JavaScript: Time elapsed from project idea to finished library: about 4 hours, during which I also bought and decorated a Christmas tree with family and watched the latest Knives Out movie. One of his lessons: If you can reduce a problem to a robust test suite you can set a coding agent loop loose on it with a high degree of confidence that it will eventually succeed. I called this designing the agentic loop a few months ago. I think it’s the key skill to unlocking the potential of LLMs for complex tasks. Our experience at Thoughtworks backs this up. We’ve been doing a fair bit of work recently in legacy modernization (mainframe and otherwise) using AI to migrate substantial software systems. Having a robust test suite is necessary (but not sufficient) to making this work. I hope to share my colleagues’ experiences on this in the coming months. But before I leave Willison’s post, I should highlight his final open questions on the legalities, ethics, and effectiveness of all this - they are well-worth contemplating.
Read more →

Writing Fragments

If you’re a regular reader of my site, you’ll have noticed that in the last few months I’ve been making a number of “fragments” posts. Such a post is a short post with a bunch of little, unconnected segments. These are usually a reference to something I’ve found on the web, sometimes a small thought of my own. A few years ago, I wouldn’t have covered these topics with posts on my own site. Instead I would use Twitter, either retweeting someone else’s point, or just highlighting something I’d found. But since the Muskover, Twitter has effectively died. I’m not saying that due to any technical issues with the site, which has mostly just been fine, nor directly due to any of the policy changes there. The point is that lots of people have left, so that the audience I would have reached with Twitter is now fragmented. Some remain on X, but I see more activity on LinkedIn. There’s also Fediverse/Mastodon and Bluesky. What this means for short posts is that I can no longer just post in one place. When I announce new articles on martinfowler.com, I announce now on four social media sites (X, LinkedIn, Fediverse, and Bluesky). It makes sense to do this, but I don’t want to go through all this hassle for the kind of micro-post that Twitter served so well. So I’ve started to batch them up. When I see something interesting, I make a note. When I have enough notes, I post a fragments post. Initially I did this in a rather ad-hoc way, just using the same mechanisms I use for most articles, but last week I started to put in some more deliberate mechanisms into the site. (If you’re observant, you’ll spot that in the URLs.) One benefit of all of this, at least in my book, is that it means my material is now fully visible in RSS. I’m probably showing my age, but I’m a big fan of RSS (or in my case, strictly Atom) feeds. I miss the feel of the heyday of the “blogosphere” before it got steamrolled by social media, and these fragment posts are, of course, just the same as the link blogs from that era. I still use my RSS reader every day to keep up with writers I like. (I’m pleased that Substack makes its content available via RSS.) It bothered me a bit that my micro-founts of Twitter knowledge weren’t visible on RSS, but was too lazy to do something about it. Now I don’t need to - the fragments are available in my RSS feed.
Read more →

Fragments Dec 11

Why does AI write like… that (NYT, gift link). Sam Kriss delves into the quiet hum of AI writing. AI’s work is not compelling prose: it’s phantom text, ghostly scribblings, a spectre woven into our communal tapestry. ❄ ❄ ❄ ❄ ❄ Emily Bache has written a set of Test Desiderata, building on some earlier writing from Kent Beck. She lists the characteristics of good tests, and how they support her four “macro desiderata” - the properties of a sound test suite Predict success in production Fast to get feedback Support ongoing code design change Low total cost of ownership She also has a great list of other writers’ lists of good test characteristics. ❄ ❄ ❄ ❄ ❄ Daphne Keller explains that the EUs fines on X aren’t about free speech. There are three charges against X, which all stem from a multi-year investigation that was launched in 2023. One is about verification — X’s blue checkmarks on user accounts — and two are about transparency. These charges have nothing to do with what content is on X, or what user speech the platform should or should not allow. ❄ ❄ ❄ ❄ ❄ Cory Doctorow The Reverse-Centaur’s Guide to Criticizing AI Start with what a reverse centaur is. In automation theory, a “centaur” is a person who is assisted by a machine. … And obviously, a reverse centaur is machine head on a human body, a person who is serving as a squishy meat appendage for an uncaring machine. Like an Amazon delivery driver… the van can’t drive itself and can’t get a parcel from the curb to your porch. The driver is a peripheral for a van, and the van drives the driver, at superhuman speed, demanding superhuman endurance.
Read more →

Fragments Dec 4

Rob Bowley summarizes a study from Carnegie Mellon looking on the impact of AI on a bunch of open-source software projects. Like any such study, we shouldn’t take its results as definitive, but there seems enough there to make it a handy data point. The key point is that the AI code probably reduced the quality of the code base - at least if static code analysis can be trusted to determine quality. And perhaps some worrying second-order effects This study shows more than 800 popular GitHub projects with code quality degrading after adopting AI tools. It’s hard not to see a form of context collapse playing out in real time. If the public code that future models learn from is becoming more complex and less maintainable, there’s a real risk that newer models will reinforce and amplify those trends, producing even worse code over time. ❄ ❄ ❄ ❄ ❄ Rob’s post is typical of much of the thoughtful writing on AI. We can see its short-term benefits, but worry about its long-term impact. But on a much deeper note is this lovely story from Jim Highsmith. Jim has turned 0x50, and has spent the last decade fighting Parkinson’s disease. To help him battle it he has two AI assisted allies. Between my neural implants and Byron’s digital guidance, I now collaborate with two adaptive systems: one for motion, one for thought. Neither replaces me. Both extend me. If you read anything on AI this week, make it be this. It offers a positive harbinger for our future and opens my mind to a whole different perspective of the role of AI in it ❄ ❄ ❄ ❄ ❄ Anthropic recently announced that it disrupted a Chinese state-sponsored operation abusing Claude Code. Jim Gumbley looks at the core lesson to learn from this, that we have to understand the serious risk of AI Jailbreaking New AI tools are able to analyze your attack surface at the next level of granularity. As a business leader, that means you now have two options: wait for someone else to run AI-assisted vulnerability detection against your attack surface, or run it yourself first. ❄ ❄ ❄ ❄ ❄ There’s plenty of claims that AI Vibe Coding can replace software developers, something that folks like me (perhaps with a bias) think unlikely. Gergely Orosz shared this tidbit Talked with an exec at a tech company who is obsessed with AI and has been for 3 years. Not a developer but company makes software. Uses AI for everything, vibe codes ideas. Here’s the kicker: Has a team of several devs to implement his vibe coded prototypes to sg workable I’d love to hear more about this (and similar stories) ❄ ❄ ❄ ❄ ❄ Nick Radcliffe writes about a month of using AI I spent a solid month “pair programming” with Claude Code, trying to suspend disbelief and adopt a this-will-be-productive mindset. More specifically, I got Claude to write well over 99% of the code produced during the month. I found the experience infuriating, unpleasant, and stressful before even worrying about its energy impact. Ideally, I would prefer not to do it again for at least a year or two. The only problem with that is that it “worked”. He stresses that his approach is the “polar opposite” of Vibe Coding. The post is long, and rambles a bit, but is worthwhile because he talks in detail about his workflow and how he uses the tool. Such posts are important so we can learn the nitty-gritty of how our programming habits are changing. ❄ ❄ ❄ ❄ ❄ Along similar lines is a post of Brian Chambers on his workflow, that he calls Issue-Driven Development (and yes, I’m also sick of the “something-driven” phraseology). As with much of the better stuff I’ve heard about AI assisted work, it’s all about carefully managing the context window, ensuring the AI is focused on the right things and not distracted by textual squirrels.
Read more →

Fragments Nov 19

I’ve been on the road in Europe for the last couple of weeks, and while I was there Thoughtworks released volume 33 of our Technology Radar. Again it’s dominated by the AI wave, with lots of blips capturing our explorations of how to use LLMs and similar technology. “Agents” are the big thing these days but we’re also seeing growing movements in infrastructure orchestration, coding workflows - and the inevitable antipatterns. Many thanks to my colleagues for putting this together again. ❄ ❄ ❄ ❄ My trip to Europe started in Amsterdam, for a Thoughtworks event for a few of our clients there. Since I was in that lovely city, I got in touch with Gergely Orosz, host of The Pragmatic Engineer, and he arranged to record a podcast with me. No surprise that AI was front-and-center of the conversation, as I said it was the biggest shift I’d seen in programming during my career, comparable only to the shift to high-level languages, which even I am not old enough to have experienced. It was a fun chat and I really enjoyed myself. Gergely later joined myself James Lewis and Giles Edwards-Alexander at the Thoughtworks event the next day. ❄ ❄ ❄ ❄ My travels also took me to Nüremberg, where I attended an internal conference for Siemens on the future of software architecture. When we think of technology, it’s easy to focus on the Faangs of Silicon Valley, but Siemens have a huge workforce of software developers working on heavy engineering systems like trains and factory automation. It was good to hear them talk about federated architectures, data mesh, and their use of AI. ❄ ❄ ❄ ❄ I’ve often used pseudo-graphs to help explain why high quality software is cheaper. This time, Kent Beck creates a unique perspective to this chart, dispensing with the temporal axis to help think in terms of optionality. ❄ ❄ ❄ ❄ And in another life, Edward has finally finished the great migration of the Heavy Cardboard studio and returns to the tubes with our first game in the new digs. (No surprise that it’s Age of Steam.)
Read more →

My Foreword to "Frictionless"

I find most writing on software productivity to be twaddle, but Nicole Forsgren and Abi Noda are notable exceptions. I had a chance to take a look at their new book, published today, and liked it so much I wrote a foreword. more…
Read more →

The Learning Loop and LLMs

Unmesh Joshi finds LLMs to be a useful tool, but explains why their help becomes illusory if we use them to shortcut the learning loop that's an essential part of our professional practice. more…
Read more →

Fragments Nov 3

I’m very concerned about the security dangers of LLM-enabled browsers, as it’s just too easy for them to contain the Lethal Trifecta. For up-to-date eyes on these issues, I follow the writings of coiner of that phrase: Simon Willison. Here he examines a post on how OpenAI is thinking about these issues. My takeaways from all of this? It’s not done much to influence my overall skepticism of the entire category of browser agents, but it does at least demonstrate that OpenAI are keenly aware of the problems and are investing serious effort in finding the right mix of protections. ❄ ❄ ❄ ❄ Rob Bowley Unsurprisingly, there are a lot of strong opinions on AI assisted coding. Some engineers swear by it. Others say it’s dangerous. And of course, as is the way with the internet, nuanced positions get flattened into simplistic camps where everyone’s either on one side or the other. A lot of the problem is that people aren’t arguing about the same thing. They’re reporting different experiences from different vantage points. His view is that beginners are very keen on AI-coding but they don’t see the problems they are creating. Experienced folks do see this, but it takes a further level of experience to realize that when used well these tools are still valuable. Interestingly, I’ve regularly seen sceptical experienced engineers change their view once they’ve been shown how you can blend modern/XP practices with AI assisted coding. The upshot is this, is that you have be aware of the experience level of whoever is writing about this stuff - and that experience is not just in software development generally, but also in how to make use of LLMs. One thing that rings clearly from reading Simon Willison and Birgitta Böckeler is that effective use of LLMs is a skill that takes a while to develop. ❄ ❄ ❄ ❄ Charlie Brown and Garfield, like most comic strip characters, never changed over the decades. But Doonesbury’s cast aged, had children, and some have died (I miss Lacey). Gary Trudeau retired from writing daily strips a few years ago, but his reruns of older strips is one of the best things in the shabby remains of Twitter. A couple of weeks ago, he reran one of the most memorable strips in its whole run. The very first frame of Doonesbury introduced the character “B.D.”, a football jock never seen without his football helmet, or when on duty, his military helmet. This panel was the first time in over thirty years that B.D. was shown without a helmet, readers were so startled that they didn’t immediately notice that the earlier explosion had removed his leg. This set off a remarkable story arc about the travails of a wounded veteran. It’s my view that future generations will find Doonesbury to be a first-class work of literature, and a thoughtful perspective on contemporary America.
Read more →

Agentic AI and Security

Agentic AI systems are amazing, but introduce equally amazing security risks. Korny Sietsma explains that their core architecture opens up security issues through what Simon Willison named the “Lethal Trifecta”. Korny goes on to talk about how to mitigate this through removing legs of the trifecta and splitting complex tasks. more…
Read more →

Fragments and Links

Mathias Verraes writes about the relationship between Domains and Bounded Contexts in Domain-Driven Design. It’s a common myth that there should always be a 1:1 relationship between them, but although it’s sometimes the case, deeper modeling often exposes a more interesting structure. Gary Marcus: (NYT Gift Link) If the strengths of A.I. are truly to be harnessed, the tech industry should stop focusing so heavily on these one-size-fits-all tools and instead concentrate on narrow, specialized A.I. tools engineered for particular problems. Because, frankly, they’re often more effective. One of the truly annoying things about the US tax system is that we can’t easily file our tax returns electronically. In recent years an initiative called “Direct File” sought to fix that. Matt Bracken tells the story of how they developed a highly-regarded system in 25 states, but was canned by the Trump administration. He also explains how the creators of Direct File are working to prepare the ground for it to reappear. Security issues are only getting worse, but the US government agency for cybersecurity is having its staff reassigned to other duties. Detailed story in Bloomberg (paywalled) and an open (but more polemic) summary on Techdirt. Changes have hit particularly hard in CISA’s Capacity Building team, which writes emergency directives and oversees cybersecurity for the government’s highest value assets, the employees said. Defense and law enforcement are valuable things for a government to do, but here they seem to be walking away from a growing crisis.
Read more →

Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl

Birgitta Böckeler has been trying to understand one of the latest AI coding buzzword: Spec-driven development (SDD). She looked at three of the tools that label themselves as SDD tools and tried to untangle what it means, as of now. more…
Read more →

Anchoring AI to a reference application

Service templates are a typical building block in the “golden paths” organisations build for their engineering teams, to make it easy to do the right thing. The templates are supposed to be the role models for all the services in the organisation, always representing the most up to date coding patterns and standards. One of the challenges with service templates though is that once a team instantiated a service with one, it’s tedious to feed template updates back to those services. Birgitta Böckeler considers whether GenAI can help with that. more…
Read more →

To vibe or not to vibe

Birgitta Böckeler examines the risk assessment around when to use vibe coding, using three dimensions of risk: Probability, Impact, and Detectability more…
Read more →

Some thoughts on LLMs and Software Development

I’m about to head away from looking after this site for a few weeks (part vacation, part work stuff). As I contemplate some weeks away from the daily routine, I feel an urge to share some scattered thoughts about the state of LLMs and AI. ❄ ❄ ❄ ❄ I’ve seen a few early surveys on the effect AI is having on software development, is it really speeding folks up, does it improve or wreck code quality? One of the big problems with these surveys is that they aren’t taking into account how people are using the LLMs. From what I can tell the vast majority of LLM usage is fancy auto-complete, often using co-pilot. But those I know who get the most value from LLMs reckon that auto-complete isn’t very useful, preferring approaches that allow the LLM to directly read and edit source code files to carry out tasks. My concern is that surveys that ignore the different work-flows of using LLMs will produce data that’s going to send people down the wrong paths. (Another complication is the varying capabilities of different models.) ❄ ❄ ❄ ❄ I’m often asked, “what is the future of programming?” Should people consider entering software development now? Will LLMs eliminate the need for junior engineers? Should senior engineers get out of the profession before it’s too late? My answer to all these questions is “I haven’t the foggiest”. Furthermore I think anyone who says they know what this future will be is talking from an inappropriate orifice. We are still figuring out how to use LLMs, and it will be some time before we have a decent idea of how to use them well, especially if they gain significant improvements. What I suggest, is that people experiment with them. At the least, read about what others are doing, but pay attention to the details of their workflows. Preferably experiment yourself, and do share your experiences. ❄ ❄ ❇ ❄ I’m also asked: “is AI a bubble”? To which my answer is “OF COURSE IT’S A BUBBLE”. All major technological advances have come with economic bubbles, from canals and railroads to the internet. We know with near 100% certainty that this bubble will pop, causing lots of investments to fizzle to nothing. However what we don’t know is when it will pop, and thus how big the bubble will have grown, generating some real value in the process, before that happens. It could pop next month, or not for a couple of years. We also know that when the bubble pops, many firms will go bust, but not all. When the dot-com bubble burst, it killed pets.com, it killed Webvan… but it did not kill Amazon. ❄ ❄ ❄ ❄ I retired from public speaking a couple of years ago. But while I don’t miss the stress of giving talks, I do miss hanging out with my friends in the industry. So I’m looking forward to catching up with many of them at GOTO Copenhagen. I’ve been involved with the GOTO conference series since the 1990s (when it was called JAOO), and continue to be impressed with how they put together a fascinating program. ✢ ❄ ❄ ❄ My former colleague Rebecca Parsons, has been saying for a long time that hallucinations aren’t a bug of LLMs, they are a feature. Indeed they are the feature. All an LLM does is produce hallucinations, it’s just that we find some of them useful. One of the consequences of this is that we should always consider asking the LLM the same question more than once, perhaps with some variation in the wording. Then we can compare answers, indeed perhaps ask the LLM to compare answers for us. The difference in the answers can be as useful as the answers themselves. Certainly if we ever ask a hallucination engine for a numeric answer, we should ask it at least three times, so we get some sense of the variation. Furthermore we shouldn’t ask an LLM to calculate an answer than we can calculate deterministically (yes, I’ve seen this). It is OK to ask an LLM to generate code to calculate an answer (but still do it more than once). ❄ ❄ ❄ ❄ Other forms of engineering have to take into account the variability of the world. A structural engineer builds in tolerance for all the factors she can’t measure. (I remember being told early in my career that the unique characteristic of digital electronics was that there was no concept of tolerances.) Process engineers consider that humans are executing tasks, and will sometimes be forgetful or careless. Software Engineering is unusual in that it works with deterministic machines. Maybe LLMs mark the point where we join our engineering peers in a world on non-determinism. ❄ ❄ ❄ ❄ I’ve often heard, with decent reason, an LLM compared to a junior colleague. But I find LLMs are quite happy to say “all tests green”, yet when I run them, there are failures. If that was a junior engineer’s behavior, how long would it be before H.R. was involved? ❄ ❄ ❄ ❄ LLMs create a huge increase in the attack surface of software systems. Simon Willison described the The Lethal Trifecta for AI agents: an agent that combines access to your private data, exposure to untrusted content, and a way to externally communicate (“exfiltration”). That “untrusted content” can come in all sorts of ways, ask it to read a web page, and an attacker can easily put instructions on the website in 1pt white-on-white font to trick the gullible LLM to obtain that private data. This is particularly serious when it comes to agents acting in a browser. Read an attacker’s web page, and it could trick the agent to go to your bank account in another tab and “buy you a present” by transferring your balance to the kind attacker. Willison’s view is that “the entire concept of an agentic browser extension is fatally flawed and cannot be built safely”.
Read more →

From Black Box to Blueprint

A common enterprise problem: crucial legacy systems become “black boxes”—key to operations but opaque and risky to touch. Thiyagu Palanisamy and Chandirasekar Thiagarajan worked with a client to use AI-assisted reverse engineering to reconstruct functional specifications from UI elements, binaries, and data lineage to overcome analysis paralysis. They developed a methodical “multi-lens” approach—starting from visible artifacts, enriching incrementally, triangulating logic, and always preserving lineage. Human validation remains central to ensure accuracy and confidence in extracted functionality. This engagement revealed that turning a system from black box to blueprint empowers modernization decisions and accelerates migration efforts. more…
Read more →

Research, Review, Rebuild: Intelligent Modernisation with MCP and Strategic Prompting

The Bahmni open-source hospital management system was began over nine years ago with a front end using AngularJS and an OpenMRS REST API. Rahul Ramesh wished to convert this to use a React + TypeScript front end with an HL7 FHIR API. In exploring how to do this modernization he used a structured prompting workflow of Research, Review, and Rebuild - together with Cline, Claude 3.5 Sonnet, Atlassian MCP server, and a filesystem MCP server. Changing a single control would normally take 3–6 days of manual effort, but with these tools was completed in under an hour at a cost of under $2. more…
Read more →

Building your own CLI Coding Agent with Pydantic-AI

CLI coding agents are a fundamentally different tool to chatbots or autocomplete tools - they're agents that can read code, run tests, and update a codebase. Ben O'Mahony explains that while commercial tools are impressive, they don't understand the particular context of our environment and the eccentricities of our specific project. Instead we can build our own coding agent by assembling open source tools, using our specific development standards for: testing, documentation production, code reasoning, and file system operations. more…
Read more →

Chatting with Unmesh about building language with LLMs

A few weeks ago, Unmesh Joshi and I started having a conversation about how he likes to grow a language of abstractions when working with an LLM. We thought this was a conversation that others might find interesting so we turned it into an article. We talk about how programming is about both building and applying abstractions and how the LLM helps us in different ways with each activity. more…
Read more →

Bliki: Expansion Joints

Back in the days when I did live talks, one of my abilities was to finish on time, even if my talk time was cut at the last moment (perhaps due to the prior speaker running over). The key to my ability to do this was to use Expansion Joints - parts of the talk that I'd pre-planned so I could cover them quickly or slowly depending on how much time I had. The way I'd do this would be to plan for some topics to be optional. The talk would work if I skipped over them, but I could also witter on about them for five (or ten) minutes. Ideally, each of these topics would get one slide, usually with a bunch of key phrases on it - the headings of what I'd talk about should I be talking about it. When I got to the slide, I'd look at how time was going with the talk. If (as was usually the case) I was running short of time, I could cover the slide in about thirty seconds, saying something like: “in doing this, there's a bunch of things you need to consider, but they are out of scope for today's talk”. If, however, I did have time, I could then spend some time talking about them. The slide would be simple, and not provide much of a Visual Channel, but that wasn't so important, after all this material was optional in the first place. The single flex-slide was my favorite Expansion Joint, as it was easy to use. Sometimes however my optional topic required a proper visual channel, necessitating dedicated slides. My solution here was good control over slide handling. Presentation tools include the ability to skip over slides while I'm talking, and I made sure I practiced how to use them so I could skip a bunch of slides without the audience knowing. It's crucial here that it's invisible to the audience, I find it looks sloppy if anyone says “in the interests of time I'll skip over these slides”. To do this, however, I do need access to my laptop while presenting, venues that only provide a clicker while loading the slides on some other machine lack that control. That started to happen in my last couple of years, much to my annoyance. When creating talks, I was always worried that I would run out of things to say, even though experience told me I reliably crammed more stuff in than I could possibly cover. Expansion Joints helped with this, I could aggressively trim the core talk to less than I needed, and rely on the Expansion Joints to fill the gap. In practice I usually didn't need the Expansion Joints anyway, but their presence helped my confidence. Using Expansion Joints was particularly important for me as I never rehearsed my talks. I was always someone whose ability to present was driven by adrenaline. Talking to a rubber duck just didn't work, the duck was clearly every bit as bored as I was. Consequently the first time I gave a talk, I was hazy as to how long it would take. Yet with Expansion Joints in place, I was able to finish a talk right on time. Expansion Joints enabled me to give the same talk to different time slots. Sometimes I'd have thirty minutes, sometimes forty-five. With Expansion Joints, I didn't need to change my slides, particularly handy if a time cut (or more rarely a time increase) appeared at the last moment. (Although in my later years, I handled this by doing a Suite Of Talks.) Talks that encourage audience interaction need these because we can never predict how much time the interaction will use up. Sometimes we get a steady stream of questions, other times (particularly in Scandinavia, or upper-Midwest America) a lack of questions had me blasting through the agenda. Any such talk needed a double-dose of this temporal ballast. Expansion Joints are at their most useful in later parts of the talk, as it's then that I have the most information on how much time I have. Earlier ones can still be handy, particularly if they come after an interactive section when I'd like to rebase my timing. Further Reading The name was coined by Neal Ford, Matthew McCullough, and Nathaniel Schutta in their excellent book Presentation Patterns.
Read more →

Team OKRs in Action

OKRs have become a popular way to connect strategy with execution in large organizations. But when they are set in a top‑down cascade, they often lose their meaning. Teams receive objectives they didn’t help create, and the result is weak commitment and little real change. Paulo Caroli describes how high‑performing teams can work in another way. They define their own objectives in an organization that uses a collaborative process to align the team’s OKRs with the broader strategy. With these Team OKRs in place, they create a shared purpose and become the base for a regular cycle of planning, check‑ins, and retrospectives. more…
Read more →

Impact Intelligence, addressing common objections

Sriram Narayan concludes his article in impact intelligence by addressing five common objections to this activity, including slowing down, lack of agility and collaboration, and the unpredictability of innovation. more…
Read more →

Quick but worthwhile links

Abi Noda observes Just met with a 2000+ eng company. Their developers are saving 2+ hours per week thanks to Copilot. But they’re also losing: 3 hrs per week due to slow builds 4 hrs per week on dev environment toil 2 hrs per week waiting for code reviews AI is not a silver bullet. Nik Malykhin found it useful to get an AI assistant to write its own coding rules by analyzing his code, and then asking it to refine them as worked with it. the central paradox of using AI assistants effectively: to offload cognitive work to an AI, you must first do the meta-cognitive work of codifying your own development philosophy and collaboration style. I agree with Charity Majors that there is a valuable distinction between disposable versus durable code, and that makes a difference in how we use AI with it. The difference between disposable code and durable code is not about whether the code was generated by AI or written by a human, or even how difficult it was to write. The cost is defined by the standards you are building to, and the rest of the software development lifecycle: how well you expect to maintain it, extend it, migrate it, understand its behavior, or fix it when it breaks. This is the expensive part of software development, the type that requires deep expertise and familiarity with your language and environment. Disposable code is cheap because you don’t even try to maintain it. Jim Highsmith thinks that we should think of AI as Alternative Intelligence It’s not fake intelligence, or artificial empathy, or HAL 9000 with manners. It’s something else. Something that thinks differently, not defectively. Rod Johnson asserts that we know that memory is important to AI systems, but we forget that Domain Models are an important form of memory Event Sourcing provides perfect episodic memory by storing the complete history of domain changes as immutable events. Every decision, every state transition, every business event is preserved with full context. Repository patterns offer domain-focused memory interfaces that understand business concepts. A CustomerRepository knows how to retrieve customer information in ways that preserve business meaning, not just raw data. Bounded contexts from Domain-Driven Design partition memory into semantic boundaries, preventing the concept pollution that plagues pure vector-based approaches. Aggregates function as cohesive memory clusters with consistency boundaries—exactly what we need for reliable agent behavior.
Read more →

Actions to improve impact intelligence

Sriram Narayan continues his article on impact intelligence by outlining five actions that can be done to improve impact intelligence: introduce robust demand management, pay down measurement debt introduce impact validation, offer your CFO/COO an alternative to ROI, equip your teams. more…
Read more →

Entertainment

Kennedy Center’s new programming head resigns days after hire was announced - The Washington Post

Kevin Couch, appointed the Kennedy Center’s senior vice president of artistic programming, resigned less than two weeks after his hire was announced.
Read more →

HBO Max Says ‘Heated Rivalry’ Is Huge — So Why Hasn’t It Registered With Nielsen? - hollywoodreporter.com

The answer may come down to how the ratings service classifies the show in its streaming ratings.
Read more →

Bruce Springsteen sings out against Trump in ‘Streets of Minneapolis’ - AP News

Bruce Springsteen has released a new song, “Streets of Minneapolis,” criticizing President Donald Trump's immigration enforcement. The song describes Minneapolis as “a city aflame” under “King Trump’s private army.” Springsteen says he wrote and recorded it o…
Read more →

Ray J Says He’s in His ‘Last Days’ Due to Deteriorating Heart Health: ‘I F-cked Up’ - Rolling Stone

Ray J said he has months to live due to a heart condition stemming from substance abuse and claimed his sister Brandy is covering his medical bills.
Read more →

Brandon Sanderson’s Literary Fantasy Universe ‘Cosmere’ Picked Up by Apple TV (Exclusive) - hollywoodreporter.com

It's an unprecedented deal for the author, whose 'Mistborn' series and 'The Stormlight Archive' are being eyed for film and television adaptation, respectively.
Read more →

Weekend Preview: SEND HELP and IRON LUNG Compete for Top Spot as MELANIA Surges in Limited Release - Boxoffice Pro

The Boxoffice Podium Forecasting the Top 3 Movies at the Domestic Box Office | January 30 – February 1, 2026 Week 5 | January 30 – February 1, 2026 1. Send Help20th Century Studios | NEWOpening Weekend Range: $12M – $17M Pros Cons 2. Iron LungMarkiplier Studi…
Read more →

Saint Laurent Fall 2026: Menswear Doesn’t Get Much Kinkier Than This - GQ

Anthony Vaccarello’s latest show had everything: big bold suits, Connor Storrie, and patent leather gimp boots.
Read more →

Kim Kardashian Explains Why Larsa Pippen Friendship Faded, Plus See Larsa’s 2020 Explanation (Including Why the Kardashians Allegedly Unfollowed Her) - Just Jared

See more...
Read more →

Taylor Swift's private texts leaked: How does she really feel about Blake Lively drama? - The Economic Times

Taylor Swift feels her privacy has been impacted. Private text messages between Swift and Blake Lively were released. This happened during Lively's legal dispute. Swift is reportedly trying to stay away from the drama. The messages were shared publicly withou…
Read more →

Matt Lauer Accuser Details Alleged 2014 Rape and Why She Didn’t Call the Police: ‘I Was in Freaking Russia. Who Would I Call? Putin? The KGB? There Was Only NBC’ - Variety

Brooke Nevils, the NBC employee who accused Matt Lauer of rape, is publishing a new book that details Lauer's alleged sexual misconduct.
Read more →

Bruce Willis Is Unaware He Has Dementia but Still Recognizes His Family, Wife Reveals - hollywoodreporter.com

The 70-year-old actor "never connected the dots that he had this disease," his wife Emma Heming Willis recently said on a podcast.
Read more →

Ms. Shirley Raines, Beauty 2 The Streetz founder who cared for the homeless on Skid Row and in Nevada, dies at 58 - abc7.com

Shirley Raines, the beloved nonprofit founder and CEO who helped care for the homeless on L.A.'s Skid Row, has died, her organization said.
Read more →

Tom Morello announces surprise benefit concert at First Avenue - bringmethenews.com

Tom Morello, guitarist of Rage Against the Machine, has announced a last-minute benefit concert at First Avenue in Minneapolis.
Read more →

Sydney Sweeney’s SYRN Line Features 4 Categories and These SI Swimsuit Snapshots Embody Each One - Sports Illustrated Swimsuit

Seductress, romantic, playful and comfy energy are no stranger to the annual issue.
Read more →

404 - OutKick

Brittney Griner is using her new documentary to draw parallels between ICE enforcement in Minnesota and her Russian detention.
Read more →

Rob Schneider’s Wife Patricia Files for Divorce - TMZ

Rob Schneider's TV producer wife Patricia is cancelling their marriage ... because we've learned she filed for divorce.
Read more →

Hailey Bieber's Sister Reportedly Facing Prison Time Over Alleged Tampon Mishap - Yahoo

The update in Alaia Baldwin Aronow's case comes days after her younger sister Hailey Bieber shut down speculations about her marriage.
Read more →

CBS News Seeks Buyouts at ‘Evening News’ - Yahoo News Canada

No content available
Read more →

‘Bluey’ Defeats ‘Stranger Things,’ Everything Else to Retain Title as Most Streamed Show in 2025 - hollywoodreporter.com

Nielsen also crowns ‘Family Guy’ creator Seth MacFarlane as a “streaming icon” in its year-end tallies.
Read more →

World

Officials warn of 'moderate' risk after TB outbreak in SF high school - SFGATE

Three active cases have been reported at Archbishop Riordan High School since last November.
Read more →

S&P 500 futures are little changed as traders weigh tech giants' earnings: Live updates - CNBC

"Magnificent Seven" companies Meta Platforms, Microsoft and Tesla posted earnings results after Wednesday's close.
Read more →

Bruce Springsteen sings out against Trump in ‘Streets of Minneapolis’ - AP News

Bruce Springsteen has released a new song, “Streets of Minneapolis,” criticizing President Donald Trump's immigration enforcement. The song describes Minneapolis as “a city aflame” under “King Trump’s private army.” Springsteen says he wrote and recorded it o…
Read more →

Trump administration finds California’s ban on ‘forced outing’ of students violates federal law - Politico

Federal officials threatened to pull education funding unless the state takes steps to amend its rules.
Read more →

'Halide' Co-Founder Sebastiaan de With Joins Apple's Design Team - MacRumors

Sebastiaan de With, co-founder of the popular iPhone camera app Halide, today announced that he has joined the Human Interface Design team at Apple. ...
Read more →

Satena: Colombia launches search for missing plane carrying 15 people - BBC

State ariline Satena says its aircraft carrying 13 passengers and two crew "suffered a fatal accident".
Read more →

OpenAI Wants To Create Biometric Social Network To Kill X’s Bot Problem - Forbes

OpenAI is quietly building a social network and considering using biometric verification like World’s eyeball scanning orb or Apple’s Face ID to ensure its users are people, not bots.
Read more →

First pediatric flu death in Washington state highlights rising cases across the state - komonews.com

A school-age teenager has died after becoming ill from influenza last week, marking the first pediatric influenza death in the state this season.
Read more →

Brandon Sanderson’s Literary Fantasy Universe ‘Cosmere’ Picked Up by Apple TV (Exclusive) - hollywoodreporter.com

It's an unprecedented deal for the author, whose 'Mistborn' series and 'The Stormlight Archive' are being eyed for film and television adaptation, respectively.
Read more →

A Seat on Trump’s “Board of Peace” Costs $1 Billion. Guess Who Gets the Money. - Slate

The leaders of China, India, and Russia are among those who haven’t yet responded.
Read more →

Home Depot to cut 800 corporate jobs, require workers back to office full time - ajc.com

Home Depot says it is eliminating about 800 corporate jobs tied to its Vinings headquarters.
Read more →

Where a nor’easter will bring heavy snow, strong winds and waves this weekend - The Washington Post

A potent system will bring the potential for blizzard-like conditions and coastal flooding from parts of the Southeast to New England.
Read more →

Trump's National Guard deployments could cost over $1 billion this year, CBO projects - NPR

The operation in Washington, D.C. alone is projected to cost upwards of $660 million if it runs through the end of this year as expected, according to new data released by the nonpartisan Congressional Budget Office.
Read more →

What it’s like each day in Minneapolis - CNN

Residents of the Twin Cities region share their personal accounts of what it’s like to live in the midst of an ICE surge.
Read more →

DEBRIEF: What happened on Day 3 of the Barcelona Shakedown? - F1 - The Official Home of Formula 1® Racing

With the third day of the Barcelona Shakedown done and dusted, F1.com has the lowdown on which teams ran and what the drivers said.
Read more →

The U.S. measles outbreaks. - Tangle News

A closer look at the rise in measles cases across the country.
Read more →

A Comprehensive Network for the Discovery and Characterization of Interstellar Objects Like… - Avi Loeb – Medium

Inspired by the unresolved anomalies displayed by the latest interstellar visitor 3I/ATLAS (as listed here), I co-authored a new paper with…
Read more →

Portland Fire reveals home and away jerseys for 2026 season - oregonlive.com

The 2026 Portland Fire jerseys are here.
Read more →

Sports

What happened on day four of F1's secretive 2026 shakedown

Aston Martin finally appeared on track in the latter stages of the fourth day of the Barcelona Formula 1 shakedown, while Mercedes kept collecting miles and McLaren's day was cut short with trouble.Fresh from turning over 90 laps apiece on Wednesday, Andrea Kimi Antonelli and George Russell traded turns again on Thursday with a combination of long runs and shorter outings to get on top of ...Keep reading
Read more →

Exclusive interview: Lowdon on why Cadillac F1 hired on values over ability

As Formula 1's first expansion team in a decade, Cadillac is very much up against it to take on F1's establishment when its veteran race winners Valtteri Bottas and Sergio Perez join the grid in March.Following an intense recruitment spree for its bases in Silverstone and North America, which the team says yielded over 140,000 job applications for around 600 positions, the organisation ...Keep reading
Read more →

How Williams benefits from F1 Barcelona shakedown - despite no running

It has been a very weird week in the world of Formula 1: cars have hit the track for the first time in 2026 but the general public, media included, has only been drip-fed information. In a world where everything is usually on constant demand, putting the Barcelona shakedown behind closed doors as teams gear up for this year’s regulation change certainly caused a stir.But alas, what can you ...Keep reading
Read more →

Alpine's innovative 2026 F1 rear wing explained

Alpine needs to turn a page in its history. Team boss Flavio Briatore insisted on swapping from Renault to Mercedes power, removing one long-standing excuse for its underperformance. The Enstone-based team finished last year at the bottom of the constructors’ championship and its engineers have responded by exploiting the new regulations with curiosity and creativity. The A526, overseen by ...Keep reading
Read more →

Mercedes impresses with first race sim at F1's Barcelona shakedown

Mercedes has delivered a strong start to the 2026 Formula 1 pre-season as Andrea Kimi Antonelli completed a full race simulation on day two.The Mercedes W17 ran reliably on both Monday and Wednesday, with George Russell having completed 92 laps by lunchtime yesterday before handing over the car to Antonelli. The youngster then reeled off 91 laps after the break, including what he said was a ...Keep reading
Read more →

What the new F1 rules mean for driver workload: ‘An element of subjectivity’

Almost every driver who has run so far during the Formula 1 shakedown in Barcelona came up with the same line after climbing out of the car: “It’s very different from what we’re used to.”That starts with the fact that the 2026 cars have considerably less downforce, and less downforce usually means more complaints from the person behind the wheel. The FIA, however, hopes that this ...Keep reading
Read more →

How Jose Mourinho's Benfica stunned Real Madrid to qualify for Champions League play-offs - BBC

Goalkeeper Anatoly Trubin's header gives Jose Mourinho an unforgettable moment as Benfica beat Real Madrid to stay alive in the Champions League.
Read more →

Todd Monken bringing George Warhop back as Browns offensive line coach - cleveland.com

George Warhop was hired as Browns offensive line coach.
Read more →

What They're Saying | NFL media reacts to Bills agreeing to terms for Joe Brady to become next head coach - buffalobills.com

Continuity is the standout point for Buffalo’s latest head coach decision.
Read more →

Bo Nix: I wasn’t predisposed to ankle injury - NBC Sports

Broncos head coach Sean Payton said this week that doctors found quarterback Bo Nix was "predisposed" to breaking his ankle while surgically repairing the injury he suffered against the Bills in the divisional round, but Nix said that wasn't the case on Wedne…
Read more →

Did the Lightning hit the weather jackpot for Sunday’s outdoor game? - tampabay.com

If you expected balmy, Florida “winter” weather to crash the party, rethink it. Rare cold temps (for Tampa) are likely.
Read more →

Patriots' Robert Kraft says Bill Belichick unequivocally deserves to be first-ballot Hall of Famer - AP News

Count New England Patriots team owner Robert Kraft among those shocked that Bill Belichick reportedly will not be selected to the Pro Football Hall of Fame in his first year of eligibility. In a statement to The Associated Press, Kraft said he believes Belich…
Read more →

Mariners broadcaster Rick Rizzs issues tearful farewell ahead of final season - The Seattle Times

Much like his broadcasts, Rick Rizzs' retirement news conference was filled with the earnest joy, passion and humility that has endeared him to generations of Mariners fans.
Read more →

Norris: "Surreal" to see number 1 on my car as McLaren kicks off F1 2026 testing

Lando Norris says driving around as the reigning Formula 1 world champion was a "surreal" feeling as he took to the track in Barcelona for a day of discovering his 2026 McLaren MCL40.McLaren delayed its start to pre-season testing until Wednesday, the third of five shakedown days at the Circuit de Barcelona Catalunya. With teams able to pick and choose a maximum of three days to run at the ...Keep reading
Read more →

Joe Lacob's make-or-break Warriors moment has arrived - SFGATE

Joe Lacob's make-or-break Warriors moment has arrived with Giannis Antetokounmpo being available for a trade.
Read more →

Alexander Volkanovski welcomes fight with top UFC lightweight contender - MMA Fighting

Alexander Volkanovski happy to fight Arman Tsarukyan someday.
Read more →

Ben: 'Mike McCarthy has heart for this team' - Steelers.com

Ben Roethlisberger shared his take on the Mike McCarthy hire on SNR's 'In the Locker Room'
Read more →

From The Desk of Allen Greene (Jan. 28, 2026) - pittsburghpanthers.com

Panther Nation,
Read more →

Giannis Antetokounmpo reportedly 'ready for a new home,' Milwaukee Bucks 'starting to listen' to trade offers - NBC Sports

While the Bucks and Antetokounmpo are apparently more interested in talking trade, it's still far more likely to happen in the offseason.
Read more →

Jets Hire Brian Duker as Defensive Coordinator - New York Jets

HC Aaron Glenn: ‘I’m Confident His Energy and Knowledge of the Game Will Help Elevate Our Players’
Read more →

What happened on day three of F1's secretive 2026 shakedown

There was a lot more action on the third day of Formula 1’s pre-season shakedown which, as is common knowledge by now, is taking place behind closed doors in Barcelona.Six out of 11 teams took to the track, including McLaren, which skipped the first two days but is planning to run continuously until Friday evening – each team is allowed three days of testing this week.Lando Norris was ...Keep reading
Read more →

Patrick Reed announces plan to return to PGA TOUR, eyes status for 2027 season - PGA Tour

Nine-time PGA TOUR winner Patrick Reed has announced plans to return to TOUR competition later this year as he looks to reinstate his membership for the 2027 se
Read more →

DEBRIEF: What happened on Day 3 of the Barcelona Shakedown? - F1 - The Official Home of Formula 1® Racing

With the third day of the Barcelona Shakedown done and dusted, F1.com has the lowdown on which teams ran and what the drivers said.
Read more →

Report: Todd Monken open to keeping Jim Schwartz as DC in 2026 - NBC Sports

The Browns are hiring Todd Monken to be their next head coach.
Read more →

Portland Fire reveals home and away jerseys for 2026 season - oregonlive.com

The 2026 Portland Fire jerseys are here.
Read more →

Williams missed Barcelona F1 test due to production delays, denies significant weight issue

Williams Formula 1 team boss James Vowles says it is "incredibly painful" for his squad to miss out the Barcelona shakedown test this week, but denies rumours the team's car will be significantly overweight.Last week Williams abandoned its plans to attend F1's first testing opportunity of the 2026 pre-season in Barcelona, losing three days of running in Spain.The reason given was "delays ...Keep reading
Read more →

It's Not a 'Moral Victory,' but Nebraska Made a Statement Against Michigan - Sports Illustrated

When you have only one loss on the year, it inherently is your worst loss of the season. However, on a Tuesday night in Ann Arbor, a shorthanded Nebraska made arguably the loudest statement of the night.
Read more →

Top 25 Mets Prospects for 2026: A.J. Ewing (6) - Amazin' Avenue

Next on our list is an outfielder.
Read more →

Why Neuville struggled in "most difficult" Rally Monte Carlo

Thierry Neuville has previously conquered Rally Monte Carlo twice, but a fundamental lack of confidence to push his Hyundai to the limit left the 2024 world champion on the back foot.Neuville had flagged even before last weekend’s season opener that he would be “lying a bit” if he said he felt confident behind the wheel of his updated Hyundai, admitting he was “missing the feeling he ...Keep reading
Read more →

Ferrari tested wet-weather active aerodynamics in F1 Barcelona shakedown

Although only two teams took to the track on the second day of Formula 1's five-day 'shakedown' behind closed doors at the Circuit de Barcelona-Catalunya, there was plenty of material for analysis.During the morning, as Ferrari focused on data collection and logging mileage on its first proper day of running the SF-26, Charles Leclerc was able to complete his first laps on a soaking track ...Keep reading
Read more →

Why Hadjar's Red Bull testing crash doesn’t mean a Gasly 2019 repeat

Seven years ago, Red Bull signed sophomore Formula 1 driver Pierre Gasly as Max Verstappen’s new team-mate following the Frenchman’s convincing rookie season, forgoing a more experienced option in Carlos Sainz – but Gasly crashed twice at Barcelona in pre-season testing and was demoted to Toro Rosso after a nightmare first half of the season.Seven years later, Red Bull signed sophomore ...Keep reading
Read more →

Ford: Horner deserves respect, but Mekies' engineering background an asset in F1

The initial filming day for Racing Bulls at Imola and the collective shakedown in Barcelona mark the first steps for Red Bull-Ford Powertrains on track.The partnership came about after Red Bull’s negotiations with Porsche – which wanted to be a partner “on equal footing” – broke down and Ford Performance director Mark Rushbrook saw his opportunity. By his own admission, he simply ...Keep reading
Read more →

Red Bull undecided on third Barcelona F1 test day after "very unfortunate" Hadjar crash

The Red Bull Formula 1 team is still evaluating its plans for a final day of Barcelona running after Isack Hadjar suffered a crash with the new RB22 on Tuesday.In tricky wet conditions, Hadjar spun backwards into the wall at Barcelona's final corner, damaging the rear of Red Bull's 2026 challenger in the process and ending the Frenchman's day. Given the limited information available from the ...Keep reading
Read more →

What happened on day two of F1's secret 2026 test

The second day of Formula 1’s secretive ‘shakedown week’ at the Circuit de Catalunya proved much quieter as only two teams took to the track. Ferrari, like McLaren, had signalled its intention to miss the opening day but the reigning constructors’ champion also did not appear on the second.Each team is permitted three days of running during the five-day ‘shakedown’, and these do ...Keep reading
Read more →

USA edging closer to WRC return in 2027

The World Rally Championship making a return to the USA next year is a step closer, with a candidate test rally planned later this year.The WRC has long held an ambition to return to the USA for the first time since the 1988 Olympus Rally, with the project a key part of its plan to grow the category. In 2024 the championship announced a “clear roadmap” to achieving a USA event in 2026 that ...Keep reading
Read more →

Hadjar crashes Red Bull’s new F1 car

Isack Hadjar has crashed Red Bull’s RB22 car on the second day of Formula 1’s first pre-season test at Barcelona.A five-day test is taking place behind closed doors at the Catalan track this week, as teams get to grips with new machinery meeting the overhauled chassis and engine regulations for 2026, with each outfit allowed to run on three of those days.Red Bull is the only squad ...Keep reading
Read more →

Why the FIA is so confident in unprecedented F1 2026 rule changes

After what seemed like an endless back and forth about Formula 1's much vaunted 2026 rules, cars have finally hit the track in Barcelona, a prelude to the series' new era.With the five-day shakedown held behind firmly closed doors, and coverage limited to guerilla reporting from a grassy knoll, the real answers of F1's new pecking order and its racing product are yet to follow, starting at ...Keep reading
Read more →

Mercedes unique 2026 F1 front wing design revealed in Barcelona test

When new rules come into force in Formula 1, it's natural to see many different interpretations across the grid, especially on those components that define the car’s overall concept. From tail to tip, from the sidepods to the suspension, all the way to the front wing, changes have swept the 2026 cars in line with the new philosophy this season.The FIA has sought to limit the outwash effect ...Keep reading
Read more →

Hadjar surprised by Red Bull F1 engine: "More laps than expected"

The first serious test for Red Bull Ford Powertrains – following an earlier filming day for Racing Bulls at Imola – has gone largely according to plan. Liam Lawson caused a red flag at the start of the lunch break, but still completed 88 laps in his Racing Bulls.Max Verstappen was not behind the wheel on Monday, but saw his team-mate Isack Hadjar lap the Circuit de Barcelona-Catalunya 107 ...Keep reading
Read more →

Russell impressed by Red Bull and Haas: "It's not quite how it was in 2014!"

George Russell has been impressed by the amount of running completed by several rival teams, including Red Bull and Haas, during the opening day of Formula 1’s 2026 Barcelona shakedown.The first day of running at the Circuit de Barcelona-Catalunya proved productive for a number of teams, despite the major changes introduced for the new regulation cycle covering both chassis and power ...Keep reading
Read more →

Different, "but still a racing car" - Drivers share early verdict on F1's 2026 cars

Other than a few short filming runs, day one of Barcelona's five-day shakedown was the first real opportunity for Formula 1 drivers to put the brand-new 2026 generation of cars to the test.Seven of the 11 teams made it out on day one, with Williams forced to skip the week completely and Aston Martin scrambling to make it out for at least two of the three days allowed per team.No reliable ...Keep reading
Read more →

What happened behind closed doors on day one of F1’s secretive 2026 shakedown

Formula 1’s shakedown week began on a crisp but chilly day at the Circuit de Barcelona-Catalunya with Mercedes’ W17 emerging from the garages first in the hands of Andrea Kimi Antonelli, followed in short order by Audi’s Gabriel Bortoleto and Alpine’s Franco Colapinto.Both Bortoleto and Colapinto would later be delayed with technical issues but the closed-door policy of what is billed ...Keep reading
Read more →

Aston Martin to lose one F1 test day, intends to run in Barcelona on Thursday

The Aston Martin Formula 1 team has announced its "intention is to run Thursday and Friday" at Barcelona's 2026 shakedown, which means it will not take up at least one of its three days of running.F1 heads to the Circuit de Barcelona-Catalunya this weekend for a five-day pre-season test, dubbed the shakedown as official testing takes place in Bahrain in February with two three-day tests.At ...Keep reading
Read more →

More than 10 tuners show interest in WRC 2027 rules

More than 10 tuners have expressed interest in the World Rally Championship’s new technical regulations for 2027, according to the FIA.Next year the WRC will embark upon a new technical era that aims to increase the number of constructors competing in the pinnacle of rallying.The new technical regulations, which will span a 10-year period, are designed to be more affordable and flexible ...Keep reading
Read more →

Aston Martin set to skip first two days of F1 2026 Barcelona test

Aston Martin is set to skip the opening two days of the first 2026 Formula 1 pre-season test in Barcelona, Autosport understands. F1 is currently hosting a five-day shakedown at Circuit de Barcelona-Catalunya (26-30 January), ahead of further tests in Bahrain (11-13 and 18-20 February) before the forthcoming campaign. This season will introduce widespread regulation changes and as a ...Keep reading
Read more →

Audi signs Slater as first academy driver

Audi has signed reigning Formula Regional European champion Freddie Slater as its Driver Development Programme’s first member.Audi has taken over the Sauber Formula 1 outfit after completing its acquisition in 2024 and announced on Friday it was launching its own academy, managed by former F1 driver and three-time Le Mans 24 Hours winner Allan McNish – who triumphed with Audi on two ...Keep reading
Read more →

Red Bull reveals actual 2026 F1 car as Barcelona test begins

Red Bull has lifted the covers off its RB22 Formula 1 car for the 2026 season.The Milton Keynes-based outfit previously revealed its livery on 15 January at an event in Detroit, Michigan, where its new engine partner Ford is based, but its actual machinery remained to be seen.The first pre-season test gets under way on Monday at Barcelona, with all teams entitled to three days of running ...Keep reading
Read more →

McLaren reveals renders of new MCL40 F1 car

McLaren has unveiled the car tasked with mounting a successful Formula 1 title defence in 2026, as it unleashed renders of its new MCL40 in a testing livery ahead of the behind-closed-doors Barcelona test.The Woking squad claimed both world championship titles last year and had sewn up the constructors' championship as early as Singapore, well before Lando Norris clinched the drivers' title in ...Keep reading
Read more →

The factors that led to Solberg’s “crazy dream” Monte Carlo win

Before the weekend, Oliver Solberg had modest expectations: tipping a top five result as his goal for his first start as a full-time factory Toyota World Rally Championship driver. However, that quickly changed after he delivered a stunning drive to win what was regarded as the toughest Monte Carlo for a generation.Extreme wintry weather plagued the asphalt event, offering up incredibly ...Keep reading
Read more →

Explained: The diffuser opening on Mercedes’ and Ferrari’s 2026 F1 cars

Caution is always required when analysing Formula 1 launches – especially with the introduction of new regulations. A few years ago, Red Bull played games with its sidepod inlets by showing different designs at the launch in Milton Keynes and on renders. During the subsequent test days in Bahrain, the design was different again, which illustrates the steps teams take to stop rivals gaining more ...Keep reading
Read more →

WRC Monte Carlo: Solberg dominates ‘proper Monte’ to claim sensational win

Oliver Solberg outlined his World Rally Championship credentials with a stunning Rally Monte Carlo victory in one of the most challenging season openers in recent memory.Toyota’s new signing defied expectations in extreme snow and icy conditions to deliver an emphatic victory, beating his more experienced Toyota team-mates Elfyn Evans [+51.8s] and reigning nine-time world champion and ...Keep reading
Read more →

Why the first 2026 F1 test is really being held in secret – and what to expect

The human brain is hard-wired to greet changes and unexpected circumstances with a stress response: the amygdalae, biochemical arbiters of the fight-or-flight instinct, flag the change as a threat.This is just one deep seated explanation for the orgy of negativity that has surrounded the 2026 Formula 1 regulations and their introduction, from drivers hating their first experiences of the '26 ...Keep reading
Read more →

Why WRC drivers hailed return of Monaco GP circuit stage

The return of World Rally Championship cars tackling Monaco’s famous Grand Prix circuit has proved a hit with drivers who wish the initiative to become a more permanent fixture in the future.Monaco’s famous circuit echoed to the sound of WRC for the first time since 2008 as a shortened version of the Formula 1 track played host to a 2.65km super special stage for this year’s Monte Carlo ...Keep reading
Read more →

WRC Monte Carlo: Solberg survives scare with healthy lead intact

A wild off-road excursion failed to derail Oliver Solberg’s Rally Monte Carlo victory bid as wintry conditions wreaked havoc at the World Rally Championship curtain raiser.Solberg continued to defy expectations, ending Saturday with a 59.3s lead over Toyota’s Elfyn Evans. Reigning world champion Sebastien Ogier had threatened to shake up the order at the front, but his charge from third ...Keep reading
Read more →

Why there are "no excuses" for Alpine in F1 2026

As Flavio Briatore said at Alpine's 2026 launch in Barcelona, his team has no more excuses. But rather than dreading the added pressure, a character-building 2025 has meant the team has been counting the days until it could finally show what it can do.It was hard not to notice Alpine was in a buoyant mood as it kickstarted its new year on the MSC World Europa, with Pierre Gasly and Franco ...Keep reading
Read more →

Haas completes 2026 F1 car shakedown ahead of Barcelona test

Haas has become the latest team to shake down its 2026 Formula 1 car, as the grid makes their final preparations for the first official test under the new regulations.F1 sophomore Oliver Bearman turned the first laps in the Haas VF-26 at Ferrari’sFiorano circuit on Saturday, running Pirelli’s special demonstration tyres as part of the team’s filming allocation.In a short social media ...Keep reading
Read more →

WRC Monte Carlo: Solberg in control, Evans holds off Ogier as conditions worsen

Oliver Solberg remains in control of Rally Monte Carlo with a lead of more than a minute, as the wintry conditions worsened at the World Rally Championship season opener on Saturday morning.Overnight snow showers meant crews faced conditions more akin to Rally Sweden than Monte Carlo, and despite initially losing time, Solberg fought back to restore his lead to 1m02.8s over Toyota team-mate ...Keep reading
Read more →

Briatore: Horner interested in Otro's Alpine F1 team stake

Flavio Briatore has said Christian Horner is one of the interested parties in buying Otro Capital’s stake in the Alpine Formula 1 team.Since being sacked as Red Bull F1 team boss last July, Horner's future has been the subject of numerous rumours and he has been linked to several teams.Horner had been in talks with Aston Martin and Haas in recent months, but the most serious possibility ...Keep reading
Read more →

Stella wants F1 to continue to openly communicate new regulations to fans

McLaren team principal Andrea Stella has urged Formula 1 to keep up its push to communicate the nuts and bolts of the 2026 regulations to the fans due to how different the racing is set to look.The forthcoming campaign will introduce what’s arguably the biggest rule change in F1 history: a car chassis is becoming lighter and smaller, while there’ll be a near 50-50 split between the ...Keep reading
Read more →

WRC Monte Carlo: Dominant Solberg exceeds Toyota’s expectations to lead Monte Carlo Rally

Oliver Solberg’s sensational run to lead Rally Monte Carlo by more than a minute has exceeded Toyota’s expectations for its new signing at the World Rally Championship season opener.Solberg starred in Thursday night’s three stages to take an impressive 44.2s lead into Saturday where he continued his stunning drive. The Swede delivered another masterclass in challenging snowy, icy and ...Keep reading
Read more →

FIA offers update on new WRC commercial rights holder search

The FIA expects to announce the new World Rally Championship commercial rights holder within the next “couple of months” with an agreement “very close”, according to FIA Deputy President for Sport Malcolm Wilson.The future promotion of the WRC has been a hot topic after it was first reported that the previous commercial rights holder WRC Promoter, owned energy drinks giant Red Bull ...Keep reading
Read more →

JA on F1 podcast: Red Bull F1 team boss Mekies on why 2026 is a new dawn

This week we have a special episode with two conversations from the Autosport Business Exchange in London.ABX is a gathering of leaders from across motorsport, exploring relevant themes. It takes place in London, Monaco and New York every year.As we head into a new season with so much renewal and so many question marks, the theme for London this year was The Power Shift. We hear ...Keep reading
Read more →

Audi launches its own F1 young driver programme

In the same week it confirmed a five-year plan to win the world championship by 2030, Audi has announced a driver development programme which will scout and nurture young talent from karting through the single-seater ladder, and perhaps ultimately to Formula 1.The move places Audi in the mainstream of F1 teams, the majority of which operate similar schemes with varying degrees of structure ...Keep reading
Read more →

WRC Monte Carlo: Solberg continues domination despite puncture

Oliver Solberg survived a slow puncture to hold a healthy Rally Monte Carlo lead, as Toyota’s new World Rally Championship signing continued his domination of the event.Solberg, co-driven by Elliott Edmondson, chalked up two stage wins from Friday morning’s three tests that served up extremely challenging snow- and ice-covered roads. The son of 2003 world champion Petter Solberg headed to ...Keep reading
Read more →

Williams to miss Barcelona test as 2026 F1 car is late

Williams will not take part in next week’s Formula 1 pre-season test at Barcelona, the team has revealed.F1 will reconvene next week at the Catalan track as the world championship’s new era begins, with overhauled technical regulations featuring active aerodynamics and a near-50:50 split between combustion and electric power.F1 squads have been open about the scale of the challenge ...Keep reading
Read more →

Alpine launches livery for 2026 F1 season on a cruise ship

Alpine has revealed the livery for its 2026 car ahead of the upcoming Formula 1 season.The Enstone-based outfit hosted its season launch on a cruise ship off the Catalan coast near Barcelona, celebrating the team's partnership with MSC Cruises.The new design is not radically different from its predecessor, with Alpine's blue associated with title-sponsor BWT's pink.Alpine has been ...Keep reading
Read more →

Controversial 2026 F1 engine loophole won't be resolved before Australian GP

Here we are and here we go: the status quo will remain as the 2026 Formula 1 season gets under way. Mercedes and Red Bull Powertrains will be able to race with power units which are understood to employ clever metallurgy to increase the compression ratio of the internal combustion engine beyond the permitted 16:1.The issue has been the matter of great intrigue since before news leaked out to ...Keep reading
Read more →

Ferrari reveals 2026 F1 car at Fiorano

Ferrari has become the latest Formula 1 team to unveil its 2026 car, the SF-26, ahead of a shakedown at its Fiorano test track.As detailed last month, the Italian team has stuck to its traditional launch plan of revealing its new car on the same day as its shakedown at Fiorano, with both Lewis Hamilton and Charles Leclerc on hand to complete the first laps of Ferrari’s 2026 ...Keep reading
Read more →

Fallows joins Racing Bulls after short-lived Aston Martin F1 stint

Racing Bulls has hired former Red Bull and Aston Martin engineer Dan Fallows as Formula 1 technical director.Fallows will report to chief technical officer Tim Goss and “take responsibility for the overall technical direction of the team, working across design, aerodynamics and performance”, the Faenza-based squad stated in a press release.Read Also:Formula 1Lindblad goes ...Keep reading
Read more →

The challenges facing Alpine ahead of F1 2026

After finishing dead last in the 2025 constructors’ championship, Alpine will attempt to bounce back in the upcoming Formula 1 season.What can the French outfit do with a largely unchanged team alongside a new engine partner? Ahead of its season launch in Barcelona on Friday, let’s delve into its prospects.What's new at Alpine?The main change for 2026 at Alpine is the team’s ...Keep reading
Read more →

The challenges facing Ferrari ahead of F1 2026

The car launch season for the 2026 Formula 1 season is firmly under way with Red Bull, Racing Bulls, Haas, Audi and Mercedes all having revealed their liveries for this year. Up next is Ferrari on Friday after what was a highly disappointing 2025 campaign, as the Italian outfit slipped to fourth in the championship and failed to win a grand prix for the first time since 2021. But 2026 ...Keep reading
Read more →

WRC Monte Carlo: Solberg stuns to lead Evans as fog red flags SS3

Oliver Solberg made a stunning start to life as a full-time World Rally Championship Rally1 driver to emerge from treacherous wintry conditions with the Monte Carlo Rally lead.Solberg produced a masterclass on the challenging snow and ice covered mountain asphalt roads to reach service with a 44.2s lead over Toyota’s Elfyn Evans. After nominal times were awarded following the red flag in ...Keep reading
Read more →

McLaren likely won’t upgrade 2026 F1 car before Australian GP

“The car everyone will see in Barcelona won’t be the car that races in Australia. I think that will be across the board, because it's simply too early.”A few days ago, Haas team principal Ayao Komatsu was confident all Formula 1 cars would evolve significantly by the Australian Grand Prix – but McLaren differs.The team won’t officially launch its MCL40 until 9 February, long ...Keep reading
Read more →

Binotto fears Audi engine performance deficit to F1 rivals in 2026

Audi Formula 1 chief Mattia Binotto is expecting his team to have an inferior power unit compared to its more established rivals in the forthcoming 2026 campaign. The German marque will make its debut as both an F1 team and engine supplier this year, after completing a takeover of Sauber to become a full factory works’ outfit. It coincides with what is arguably the biggest rule change ...Keep reading
Read more →

Mercedes’ completes 2026 F1 car shakedown at Silverstone

Mercedes’ 2026 Formula 1 car has hit the track for the first time as the Brackley-based squad ramps up preparations for the series’ new technical era. Just hours after the official unveiling of the car on Thursday, George Russell and Andrea Kimi Antonelli put the W17 through its paces at Silverstone, running Pirelli’s grooved ‘demo’ tyres.Read Also:Formula 1Mercedes ...Keep reading
Read more →

Can Lancia enjoy success on its anticipated WRC return?

Lancia is confident it can immediately be in a position to fight for victories and a championship title on its return to the World Rally Championship this year.The famous Italian brand will return to the WRC stages at this weekend’s season opener in Monte Carlo with its all-new Ypsilon HF Integrale Rally2 car to do battle in the championship’s second tier WRC2 category.Lancia’s ...Keep reading
Read more →

The headaches WRC crews must soothe ahead of a ‘proper Monte’

World Rally Championship crews are braced for a ‘proper, old school’ Rally Monte Carlo with snow and wintry conditions set to become a major factor at the 2026 season opener.In recent seasons, the annual WRC curtain raiser – held on the famous twisty mountain road in the French Alps – has been largely run in dry conditions, devoid of the notorious snow and icy conditions synonymous ...Keep reading
Read more →

Norris shares 2026 F1 target in defiant message after 2025 championship title

Lando Norris has said his goal is to secure a back-to-back Formula 1 world drivers' championships in 2026, after winning his first in 2025.The McLaren driver accepted the Autosport Champion award at the 2026 Autosport Awards and confirmed to the cheering crowd that his eyes are on the title once again in the upcoming season. "It's absolutely the goal. Yes, it's absolutely, absolutely the ...Keep reading
Read more →

FIA aims to "resolve" engine loophole controversy before start of F1 2026 season

The FIA has said it is keen to settle Formula 1's first major technical controversy before the 2026 era gets under way in Australia.Several manufacturers believe Mercedes and Red Bull Powertrains have come up with a trick to cleverly exploit the F1 2026's power unit regulations, which prescribe a compression ratio of 16:1 down from 18:1 last year.That compression ratio is measured when the ...Keep reading
Read more →

Microsoft switches F1 sponsorship from Alpine to Mercedes

Mercedes has signed a multi-year deal with Microsoft, starting from the 2026 season.Microsoft was a long-time partner of the rival Alpine squad, which it first sponsored in 2012 when the outfit was named Lotus, but has now switched allegiances as its Alpine deal ended following the 2025 campaign.The Microsoft logo will be displayed on the airbox and the front wing endplates of the newly ...Keep reading
Read more →

Mercedes reveals new-look F1 design for 2026

Mercedes has lifted the covers off the W17, its new Formula 1 car for the 2026 season.The new machinery meets F1’s new technical regulations on the chassis and engine sides, featuring active aerodynamics and a near-50:50 split between combustion and electric energy.The W17 sports a mostly unchanged black and silver design, with turquoise accents as a nod to everlasting sponsor ...Keep reading
Read more →

Why McLaren won't run on the first day of F1's Barcelona test

McLaren will not run on the opening day of Formula 1's behind-closed-doors test at Barcelona as it sought to maximise the development time of its new title defender - the MCL40.Audi, Cadillac, Racing Bulls, and now Alpine have given their new 2026 models track time in private shakedown events, aiming to gather early reference points for F1's week of running at Barcelona. This begins on the 26 ...Keep reading
Read more →

"It doesn't matter how good the car is": Stewart on what makes Norris a true champion

Three-time Formula 1 world champion and motorsport legend Sir Jackie Stewart has offered his verdict on Lando Norris after the Briton secured his first drivers' title in 2025. Speaking at the Autosport Awards, where the McLaren driver was celebrated by the organisation, Sir Stewart was asked about how he rated Norris as a champion. “He’s a very stable young man,” he said. "First of ...Keep reading
Read more →

Ogier hungry for record 10th WRC title on eve of Rally Monte Carlo

Sebastien Ogier says a repeat of his 2025 World Rally Championship success will be difficult, but admits the motivation “to go for it” remains amid talk of a record-breaking 10th title.Barely hours after matching Sebastien Loeb as a fellow nine-time world champion in November, Ogier was already facing questions about the possibility of fighting for a 10th title in 2026.On the eve of ...Keep reading
Read more →

Why Hyundai is confident of challenging dominant Toyota in WRC 2026

After coming agonisingly close to a drivers' and manufacturers' double title success in 2024, Hyundai found itself resoundingly beaten by rivals Toyota in the WRC last season, winning just two rallies (Greece, Saudi Arabia) compared to Toyota’s tally of 12 victories.Hyundai's 2025 struggles can be pinpointed to a number of variables. The squad heavily invested in an ‘Evo’ version of its ...Keep reading
Read more →

Sesks set to make WRC return in 2026

Martins Sesks has announced plans to contest a partial campaign in the 2026 World Rally Championship with M-Sport-Ford.Sesks and co-driver Renars Francis are set to team up with the British squad for a third season, aiming to pilot a Ford Puma Rally1 in seven WRC events beginning in Rally Sweden (12-15 February) next month. Outings in Portugal, Greece, Estonia, Finland, Sardinia and Saudi ...Keep reading
Read more →

Hyundai unleashes refreshed 2026 WRC challenger

Hyundai has revealed its new-look i20 N Rally1 that it hopes will close the gap to rivals Toyota in the 2026 World Rally Championship.The Korean brand will sport a new livery on its car for the 2026 season that will be driven by 2024 champion Thierry Neuville and Adrien Fourmaux, while the third car will be shared across Dani Sordo, Esapekka Lappi and Hayden Paddon, who rejoins the team after ...Keep reading
Read more →

M-Sport reveals 2026 WRC Ford Puma

M-Sport-Ford has taken the covers off the final iteration of the current Ford Puma Rally1 car that will tackle the 2026 World Rally Championship. The British squad has once again opted for a livery change for the new season with the purple look, featuring Red Bull branding from 2025, replaced with a striking white, green and blue colour scheme.The change of livery has been partly ...Keep reading
Read more →

How five-time runner-up Evans plans to finally become WRC champion

Elfyn Evans says lessons have been learned and areas for improvement identified to become world rally champion after the agony of losing the title by four points to Sebastien Ogier last year.The Toyota driver heads into the 2026 season as a five-time runner-up and the most successful driver of the current World Rally Championship crop yet to secure a world title. Last year, Evans came the ...Keep reading
Read more →

How did Rovanpera's single-seater debut go?

As two-time World Rally champion Kalle Rovanpera tackles a new career in single-seaters, moving to Super Formula for 2026, he’s getting crucial experience in the Formula Regional Oceania Trophy.Formerly known as the Toyota Racing Series, the championship is typically used by young drivers to compete in the winter break, as it takes place in New Zealand’s summer. Previously won by Lance ...Keep reading
Read more →

Why M-Sport chose youth over experience for its 2026 WRC line-up

M-Sport favouring youth over experience when it comes to its World Rally Championship driver line-up is nothing new, having developed a reputation for being perennial producer of star talent.The British squad has provided a valuable proving ground for WRC stars of the future, with many of those going on to win or challenge for world titles. Its most recent success alumni being Ott Tanak and ...Keep reading
Read more →

Heavy snow forces Hyundai to postpone Neuville Monte Carlo test

Heavy snowfall has forced Hyundai to postpone Thierry Neuville’s pre-event test ahead of the World Rally Championship season opener in Monte Carlo.WRC teams have headed to the south of France this week to test in preparation for the annual asphalt curtain raiser to be held from 22-25 January.While Toyota began testing on Tuesday with Elfyn Evans and Oliver Solberg running in largely dry ...Keep reading
Read more →

Toyota unveils new look 2026 WRC challenger

Toyota has revealed a bold new look for its 2026 GR Yaris Rally1 cars that will contest this year’s World Rally Championship.The Japanese marque, which won last year’s title, has opted for a fresh new livery utilising the team’s red, black and white scheme with red now the predominant colour. For the past two seasons an all-black livery has adorned Toyota’s factory WRC entries ...Keep reading
Read more →

Toyota selects Solberg to score manufacturer points in WRC 2026 opener

New Toyota signing Oliver Solberg has been nominated to score manufacturer points at the opening round of the 2026 World Rally Championship season in Monte Carlo.Event organisers have today released its 66-car entry list for the annual WRC curtain raiser (22-25 January), which will feature 11 Rally1 entries and 27 Rally2 crews, while 25 of those cars are registered to score WRC2 ...Keep reading
Read more →

Ogier pays tribute to Tanak: He "pushed me the hardest"

Sebastien Ogier says Ott Tanak pushed him “harder than anyone else” as the nine-time world rally champion paid tribute to his rival and friend, who will take a sabbatical in 2026.Tanak shocked the rally world in November when the 2019 champion announced plans to take a break from full-time competition in the WRC next year, forsaking a 2026 seat at Hyundai in the process.Read ...Keep reading
Read more →

Toyota WRC team principal buys top WRC2 squad

Toyota World Rally Championship team principal Jari-Matti Latvala has purchased the WRC2 title-winning Printsport Racing outfit.JML-WRT Oy, a company created by the 18-time WRC rally winner, has acquired the Finnish rally team that has recently guided Sami Pajari and Oliver Solberg to WRC2 titles in 2024 and 2025 respectively. Both drivers have since graduated to Toyota’s Rally1 WRC ...Keep reading
Read more →

Why Hyundai expects to be stronger in WRC 2026

Hyundai will be "better prepared" and "stronger" in the World Rally Championship next year after a difficult 2025, according to team principal Cyril Abiteboul.The Korean marque was soundly beaten this season scoring two victories, at Acropolis Rally Greece and in Saudi Arabia, while rivals Toyota chalked up a stunning 12 wins spread across drivers Sebastien Ogier, Elfyn Evans, Kalle Rovanpera ...Keep reading
Read more →

Munster set for Dakar Rally, WRC Monte Carlo double-header

Gregoire Munster will rejoin M-Sport-Ford for the opening round of the 2026 World Rally Championship next month off the back of competing as a co-driver at the Dakar Rally.After two full WRC seasons piloting a Ford Puma Rally1, Munster’s rallying future appeared uncertain after M-Sport announced Josh McErlean and new recruit Jon Armstrong as its drivers to contest the full 2026 ...Keep reading
Read more →

Toyota brings Corolla back to rallying with all-new rally car

Toyota will bring the Corolla name back to rallying with its newly-developed GR Corolla RC2 rally car, which will compete in the American Rally Association (ARA) National Championship next year.The Japanese brand initially showcased its GR Corolla rally car concept at the Tokyo Auto Salon in January this year. The car has since undergone further development led by Toyota’s World Rally ...Keep reading
Read more →

Autosport Top 50 of 2025: #34 Oliver Solberg

Oliver Solberg produced as close to a perfect season as is possible in 2025, while confirming his future World Rally Championship title contender credentials.Solberg and co-driver Elliott Edmondson delivered the shock of the season by claiming a deserved and stunning maiden outright WRC win in Estonia, in a one-off drive for Toyota.On top of that, Solberg secured five WRC2 victories to ...Keep reading
Read more →

Autosport Top 50 of 2025: #30 Kalle Rovanpera

For large parts of the season Kalle Rovanpera looked unusually lost as he struggled to understand and extract speed from the new Hankook tyres.But when he did gel with the rubber, he was untouchable, leading to dominant wins for Toyota in the Canary Islands, Finland and Central Europe.That run ignited a title bid that went to the final round, which had seemed unlikely in mid-season. This ...Keep reading
Read more →

Autosport Top 50 of 2025: #27 Ott Tanak

The fact that Ott Tanak was able to take the fight to Sebastien Ogier and Toyota in a Hyundai lacking the speed and reliability of its rivals was hugely impressive. He claimed 56 stage wins, only four shy of Ogier’s season-best tally.Tanak even led the title race after Rally Estonia, and in Greece managed to beat Ogier in a head-to-head to claim Hyundai’s only victory before Thierry ...Keep reading
Read more →

Autosport Top 50 of 2025: #24 Elfyn Evans

The history books will forever say 2025 was the year Elfyn Evans became a five-time WRC runner-up, but in truth this was his best campaign so far. The Toyota driver led the championship for most of the season after making a blistering start with wins in Sweden and Kenya.He was also the WRC’s most consistent driver, finishing every event inside the top six. Opening the road during the summer ...Keep reading
Read more →

Autosport Top 50 of 2025: #4 Sebastien Ogier

Sebastien Ogier produced arguably the best season of his WRC career to date to equal Sebastien Loeb as a nine-time world champion, and assert himself for many as the greatest rally driver of all time.The feat is even more impressive considering Ogier and co-driver Vincent Landais contested only 11 of the 14 rounds and were up against Elfyn Evans, Kalle Rovanpera, Ott Tanak and 2024 champion ...Keep reading
Read more →

FIA reveals first look at WRC 2027 cars

The FIA has offered up a first look at the World Rally Championship cars of the future, built under the new technical regulations that will come into force from 2027.The new technical regulations, which will span a 10-year period, are designed to be more affordable and flexible in a bid to attract new manufacturers and teams to the series. Cars will be built to a €345,000 cost cap, deliver ...Keep reading
Read more →

Does Sesks have a future with M-Sport in WRC?

M-Sport Ford is involved in ongoing discussions to add Martins Sesks to its World Rally Championship driver line-up for 2026.The British squad announced its full-time driver line-up for next year this week, which sees Josh McErlean joined by European Rally Championship title runner-up Jon Armstrong – who will make the leap to Rally1 machinery for the first time. The decision comes as part ...Keep reading
Read more →

FIA announces new constructor set to join WRC 2027

The FIA has revealed details of a new constructor that is developing a car to compete under the World Rally Championship’s new technical regulations in 2027.Founded by experienced motorsport engineer Lionel Hansen, former FIA rally director and Citroen WRC boss Yves Matton and Prospeed, Project Rally One represents the first project to be officially led by a tuner under the WRC’s new ...Keep reading
Read more →

M-Sport hands Armstrong WRC drive alongside McErlean for 2026

European Rally Championship title runner-up Jon Armstrong will join Josh McErlean as part of a new-look M-Sport Ford World Rally Championship driver line-up for 2026. The British squad’s decision to retain McErlean after an impressive maiden Rally1 season and add two-time ERC rally winner Armstrong to the line-up comes as part of an expanded collaboration with the Motorsport Ireland Rally ...Keep reading
Read more →

The WRC second shot born out of never giving up

It’s fair to say Hayden Paddon’s return to the World Rally Championship next year after an eight-year hiatus was a surprise for the rally world to digest. It was even a shock for Paddon, but it was one of those “good surprises” that underlines why it is important to never give up on dreams.That dream is now reality, as the 38-year-old finds himself preparing to contest next month’s ...Keep reading
Read more →

Lancia confirms driver line-up for 2026 WRC return

Lancia has unveiled its driver line-up ahead of its long-awaited return to the World Rally Championship next year.The Italian brand that won a record 10 WRC manufacturers’ titles named Yohan Rossel and Nikolay Gryazin as its drivers for its comeback, which will see the automaker contest the second tier WRC2 championship.Rossel and Gryazin will pilot Lancia’s all-new Ypsilon HF ...Keep reading
Read more →

FIA shares final details of 2027 WRC regulations

The FIA has today confirmed the final elements of the World Rally Championship‘s new technical regulations, which will come into force from 2027.The 2027 regulations, originally unveiled in December last year, are designed to be more affordable and flexible in a bid to attract new manufacturers and teams to the series.Cars will be built to a €345,000 cost cap, deliver approximately 300 ...Keep reading
Read more →

What Fourmaux and Solberg learned from 2026 Monte Carlo test

Preparations for the 2026 World Rally Championship began in earnest last weekend just a matter of days after Rally Saudi Arabia brought the curtain down on the 2025 campaign.With the opening round of the 2026 season in Monte Carlo a little more than a month away on 22-25 January, Toyota and Hyundai both fielded cars in last weekend’s Rallye National Hivernal du Devoluy in France.The ...Keep reading
Read more →

Paddon, Lappi and Sordo join Hyundai 2026 WRC line-up

World Rally Championship event winners Dani Sordo, Esapekka Lappi and Hayden Paddon will rejoin Hyundai to share its third car next year following the departure of Ott Tanak.The trio will join 2024 world champion Thierry Neuville and multiple podium finisher Adrien Fourmaux after the pair agreed contract extensions to pilot the team’s two other i20 N Rally1 entries on a full-time ...Keep reading
Read more →

Solberg steps up WRC 2026 preparations with Rally1 outing

Toyota's new signing Oliver Solberg will drive a GR Yaris Rally1 car on asphalt for the first time this week as he steps up his preparations for the 2026 World Rally Championship.A week on from Rally Saudi Arabia, where Solberg concluded a 2025 campaign that yielded a WRC2 title and a maiden outright WRC victory at Rally Estonia, the Swedish rally ace and co-driver Elliott Edmondson are back ...Keep reading
Read more →

Hyundai upgrading Rally2 car to “cover all bases” as 2027 WRC decision looms

Hyundai has started upgrading its Rally2 car as an option should it decide to continue its involvement in the World Rally Championship in 2027.The Korean’s manufacturer's long-term future in rallying’s top tier has been shrouded in uncertainty for months, with a call to contest next season - the final year of Rally1 regulations - only announced in August.Read ...Keep reading
Read more →

How Ogier matched Loeb's WRC record in demanding desert duel

At the finish of the most gruelling rally of the year there was a familiar smile on the face of Sebastien Ogier as he and co-driver Vincent Landais unfurled a Tricolore with a nine slapped bang in the centre.The inaugural Rally Saudi Arabia will forever be associated with a huge moment in World Rally Championship history, as Ogier matched the nine title record of the great Sebastien Loeb and ...Keep reading
Read more →

Was Rally Saudi Arabia too extreme for WRC?

World Rally Championship drivers feel Rally Saudi Arabia’s unique challenges deserve a place on the calendar, but the event is too extreme for the cars and tyres to host a title decider.Saudi Arabia made its WRC debut at the weekend after signing 10-year deal with the championship to host the final round for at least five years of the deal. The rally, based out of Jeddah, delivered a ...Keep reading
Read more →

Rovanpera calls time on WRC career: ‘Rallying always came naturally to me’

Kalle Rovanpera says rallying will always hold a special place in his heart after closing the chapter on a record-breaking World Rally Championship career in favour of a move to single-seater racing. A fairytale finish to his WRC career eluded the Toyota driver, who missed out on securing a third world title, finishing seventh at the Rally Saudi Arabia season finale after suffering ...Keep reading
Read more →

No regrets: Tanak grateful to have chased WRC dream

Rally Saudi Arabia marked Ott Tanak’s final outing with Hyundai after the 2019 world champion announced plans to take a sabbatical earlier this month to reset and spend more time with his family. Tanak had been in the victory hunt in Saudi Arabia before a series of punctures on Friday ended any hopes of a podium to sign off his time in the WRC. While the 38-year-old ...Keep reading
Read more →

Weather

Current Weather Conditions

Current Conditions: Clouds - scattered clouds

Temperature: 31.55°F (Feels like: 24.96°F)

Wind: 7 mph

Humidity: 75%

Sunrise: 07:39, Sunset: 17:41

5-Day Weather Forecast

Thursday

Clouds: few clouds

High: 40.91°F, Low: 23.88°F

Friday

Clouds: broken clouds

High: 44.76°F, Low: 26.47°F

Saturday

Clouds: broken clouds

High: 49.84°F, Low: 37.53°F

Sunday

Clouds: overcast clouds

High: 54.23°F, Low: 40.69°F

Monday

Clouds: overcast clouds

High: 52.9°F, Low: 41.47°F

Last updated: 2026-01-29 20:30:22