Technology
From Rockets to Heat Pumps
2025-12-06 00:41 | Source: Hacker News
Comments
Read more →
Extra Instructions of the 65XX Series CPU
2025-12-06 00:38 | Source: Hacker News
Comments
Read more →
Sam Altman's Dirty DRAM Deal
2025-12-06 00:24 | Source: Hacker News
Comments
Read more →
Dithering: ‘Alan Dye Leaves Apple’
2025-12-05T23:25:09Z | Source: Daring Fireball
Dithering is my and Ben Thompson’s twice-a-week podcast — 15 minutes per episode, not a minute less, not a minute more. It’s a $7/month or $70/year subscription, and included in the Stratechery Plus bundle (a bargain). This year our CMS (Passport — check it out) gained a feature that lets us make some episodes free for everyone to listen to on the website. Today’s episode, regarding Alan Dye leaving Apple for Meta, seems like a good one to do that with. (And, once again, this month’s album art serendipitously captures my mood.) Give it a listen. Subscribe if you enjoy it. ★
Read more →
Apple’s Succession Intrigue Isn’t Strange at All
2025-12-05T23:08:12Z | Source: Daring Fireball
Aaron Tilley and Wayne Ma, in a piece headlined “Why Silicon Valley is Buzzing About Apple CEO Succession” at the paywalled-up-the-wazoo The Information: Prediction site Polymarket places Ternus’ odds of getting the job at nearly 55%, ahead of other current Apple executives such as software head Craig Federighi, Chief Operating Officer Sabih Khan and marketing head Greg Joswiak. But some people close to Apple don’t believe Ternus is ready to take on such a high-profile role, and that could make a succession announcement unlikely anytime soon, said people familiar with the company. Nothing in the rest of the article backs up that “some people close to Apple don’t believe Ternus is ready” claim, other than this, several paragraphs later: And while his fans believe Ternus has the temperament to be CEO, many of them say he isn’t a charismatic leader in the mold of a Jobs. He has also had little involvement in the geopolitical and government affairs issues that dominate most of Cook’s time these days. On a recent trip to China, for example, Apple’s new COO, Sabih Khan, accompanied Cook to some of his meetings. No one else in the history of the industry, let alone the company, has the charisma of Steve Jobs. And while I think Polymarket has the shortlist of candidates right, I also think they have them listed in the right order. Sabih Khan probably should be considered an outside-chance maybe, but the fact that he accompanied Cook to China doesn’t me make think, for a second, that it’s in preparation to name him CEO. If Kahn were being groomed to become CEO, he’d have started appearing in keynotes already. It’s silly to slag Ternus for not having the charisma of Steve Jobs, when Ternus has been a strong presence in keynotes since 2018, and in the same paragraph suggest Khan as a better option, when Khan has never once appeared in a keynote or public appearance representing Apple. Some former Apple executives hope a dark-horse candidate emerges. For example, Tony Fadell, a former Apple hardware executive who coinvented [sic] the iPod, has told associates recently that he would be open to replacing Cook as CEO, according to people who have heard his remarks. (Other people close to Apple consider Fadell an unlikely candidate, in part because he was a polarizing figure when he worked at the company. Fadell left Apple in 2010.) The parenthetical undersells the unlikelihood of Fadell returning to Apple, ever, in any role, let alone the borderline insanity of suggesting he’d come back as Cook’s successor. It has become one of the strangest succession spectacles in tech. Typically, the kind of buzz that is swirling around Cook occurs when companies are performing badly or a CEO has dropped hints that they’re getting ready to hang up their spurs. Neither applies in Cook’s case, though. There’s nothing strange about it. Apple has a unique company culture, but so too do its peers, like Microsoft, Amazon, and Google. And just like at those companies, it’s therefore a certainty that Cook’s replacement will come from within the company’s current ranks. Polymarket doesn’t even list anyone other than Ternus, Federighi, Joswiak, and Khan. As for hints, there is not much need for any hint beyond the fact that Cook is now 65 years old and has been in the job since 2011. But the high-profile multi-source leak to the Financial Times is a pretty obvious fucking additional hint. ★
Read more →
Lisa Jackson on The Talk Show Back in 2017
2025-12-05T22:37:35Z | Source: Daring Fireball
This interview was both interesting and a lot of fun. Worth a listen or re-listen. ★
Read more →
Apple Announces a Few Other Executive Transitions
2025-12-05T22:18:12Z | Source: Daring Fireball
Apple Newsroom, yesterday: Apple today announced that Jennifer Newstead will become Apple’s general counsel on March 1, 2026, following a transition of duties from Kate Adams, who has served as Apple’s general counsel since 2017. She will join Apple as senior vice president in January, reporting to CEO Tim Cook and serving on Apple’s executive team. In addition, Lisa Jackson, vice president for Environment, Policy, and Social Initiatives, will retire in late January 2026. The Government Affairs organization will transition to Adams, who will oversee the team until her retirement late next year, after which it will be led by Newstead. Newstead’s title will become senior vice president, General Counsel and Government Affairs, reflecting the combining of the two organizations. The Environment and Social Initiatives teams will report to Apple chief operating officer Sabih Khan. [...] Newstead was most recently chief legal officer at Meta and previously served as the legal adviser of the U.S. Department of State, where she led the legal team responsible for advising the Secretary of State on legal issues affecting the conduct of U.S. foreign relations. Monday’s announcement that AI head John Giannandrea is retiring and the hierarchy for AI related projects being further reshuffled under software head Craig Federighi was significant, but not surprising, given how things went this year for Apple with AI. Wednesday’s announcement that VP of design and Liquid Glass frontman Alan Dye is leaving Apple for Meta was a shock, both inside and outside the company. As I wrote this week, I think it’s great news for Apple, but not by plan. This news yesterday is just typical planned retirements. The timing is slightly unfortunate though. In the eyes of observers unfamiliar with the company, they might be misconstrued as signs of executive upheaval, occurring on the heels of the minor and major dramas of Giannandrea’s and Dye’s departures. The Jackson / Adams / Newstead transitions announced yesterday are nothing of the sort. Jackson had a very nice run at Apple and carved out a rather unique position within the company. Apple’s environmental efforts expanded tremendously under her leadership. I’ve never met anyone with a bad word to say about her, and in my own interactions, found her downright delightful. As for Adams, the responsibilities of Apple’s general counsel are generally far afield from my interests. The only two times I’ve mentioned her at DF were when she got the job in 2017, and a passing reference when the FBI sent a letter to Apple, addressed to Adams, in 2020 regarding the locked phone of a mass shooter in Pensacola, Florida. That’s a sign of a good run for a general counsel — it’s a job where no news is good news. Lastly, I wouldn’t read anything into Newstead coming to Apple by way of Meta. But it is a bit funny that it was announced the day after Dye left Apple for Meta. She seems to have an excellent wide-ranging background to spearhead Apple’s government affairs. Her stint in the State Department was during the first (now seemingly sane) Trump administration, but she clerked for liberal Supreme Court Justice Stephen Breyer. ★
Read more →
★ 2025 App Store Award Winners: Tiimo, Essayist, and Detail
2025-12-05T21:46:12Z | Source: Daring Fireball
Apple, today: “Announcing the 2025 App Store Awards”: This year’s winners represent the best-in-class apps and games we returned to again and again. We hope you enjoy them as much as we do. I did not enjoy all of them as much as Apple did. Tiimo iPhone app of the year Tiimo bills itself as an “AI Planner & To-do” app that is designed with accommodations for people with ADHD and other neurodivergences. Subscription plans cost $12/month ($144/year) or $54/year ($4.50/month). It does not offer a native Mac app, and at the end of onboarding/account setup, it suggests their web app for use on desktop computers. When I went to the web app, after signing in with the “Sign in With Apple” account I created on the iPhone app, Tiimo prompted me to sign up for an annual subscription for $42/year ($3.50/month), or monthly for $10 ($120/year). The in-app subscriptions offer a 30-day free trial; the less expensive pay-on-the-web subscriptions only offer a 7-day free trial. The web app doesn’t let you do anything without a paid account (or at least starting a trial); the iOS app offers quite a bit of basic functionality free of charge. From Apple’s own description for why it gave Tiimo the award: Built to support people who are neurodivergent (and anyone distracted by the hum of modern life), Tiimo brought clarity to our busy schedules using color-coded, emoji-accented blocks. The calming visual approach made even the most hectic days feel manageable. It starts by syncing everything in Calendar and Reminders, pulling in doctor’s appointments, team meetings, and crucial prompts to walk the dog or stand up and stretch. Instead of dumping it all into a jumbled list, the app gives each item meaning by automatically assigning it a color and an emoji. (Tiimo gave us the option to change the weightlifter emoji it added to our workout reminders, but its pick was spot on.) While on the move with coffee in one hand and keys in the other, we sometimes talked to Tiimo with the Al chatbot feature to add new tasks or shift appointments. When we felt overwhelmed by our to-do list, Tiimo kept us laser-focused by bubbling up just high-priority tasks, while its built-in Focus timer (accessible from any to-do with a tap) saved us from the pitfalls of multitasking. But Tiimo really stood out when we faced a big personal project, like getting our Halloween decorations up before Thanksgiving. With the help of Al, the app suggested all the smaller tasks that would get us there: gathering the decorations from the garage, planning the layout, securing the cobwebs, and doing a safety check. Aside from the web app, Tiimo is iOS exclusive, with apps only for iPhone, iPad, and Apple Watch. No Android version. It seems to do a good job with native platform integration (Calendar integration is free; Reminders integration requires a subscription). Animations in the app feel slow to me, which makes the app itself feel slow. And, personally, I find Tiimo’s emphasis on decorating everything with emoji distracting and childish, not clarifying. The app seems OK, but not award-worthy to me. But, admittedly, I’m not in the target audience for Tiimo’s ADHD/neurodivergent focus. I don’t need reminders to have coffee in the morning, start work, have dinner, or to watch TV at night, which are all things Tiimo prefilled on my Today schedule after I went through onboarding. As I write this sentence, I’ve been using Tiimo for five minutes, and it’s already prompted me twice to rate it on the App Store. Nope, wait, I just got a third prompt. That’s thirsty, and a little gross. (And, although I’m not an ADHD expert, three prompts to rate and review the app in the first 10 minutes of use strikes me as contrary to the needs of the easily distracted.) Essayist Mac app of the year Essayist bills itself as “The Word Processor designed for Academic Writing” (capitalization verbatim). Subscriptions cost $80/year ($6.67/month) or $10/month ($120/year). Its raison d’être is managing citations and references, and automatically formatting the entire document, including citations, according to a variety of standards (MLA, Chicago, etc.). Quoting from Apple’s own description of Essayist: Essayist gives you an easy way to organize a dizzying array of primary sources. Ebooks, podcasts, presentations, and even direct messages and emails can be cataloged with academic rigor. Using macOS Foundation Models, Essayist extracts all the key info needed to use it as a source. For example, paste a YouTube URL into an entry and Essayist automatically fills in the name of the video, its publication date, and the date you accessed it. Drag in an article as a PDF to have Essayist fill in the title, author, and more — and store the PDF for easy access. You can also search for the books and journal articles you’re citing right in the app. Essayist is a document-based (as opposed to library-based) app, and its custom file format is a package with the adorable file extension “.essay”. The default font for documents is Times New Roman, and the only other option is, of all fonts, Arial — and you need an active subscription to switch the font to Arial. (Paying money for the privilege to use Arial... Jiminy fucking christ. I might need a drink.) I appreciate the simplicity of severely limiting font choices to focus the user’s attention on the writing, but offering Times New Roman and Arial as the only options means you’re left with the choice between “the default font’s default font” and “font crime”. The Essayist app itself has no Settings; instead, it offers only per-document settings. The app carries a few whiffs of non-Mac-likeness (e.g. the aforementioned lack of Settings, and some lame-looking custom alerts). The document settings window refers to a new document, even after it has been saved with a name, as “Untitled” until you close and reopen the document. Reopened documents do not remember their window size and position. But poking around with otool, it appears to be written using AppKit, not Catalyst. I suspected the app might be Catalyst because there are companion iOS apps for iPhone and iPad, which seem to offer identical feature sets as the Mac app. Essayist uses a clever system where, unless you have a subscription, documents can only be edited on the device on which they were created, but you can open them read-only on other devices. That feels like a good way to encourage paying while giving you a generous way to evaluate Essayist free of charge. There is no Android, Windows, or web app version — it’s exclusive to Mac and iOS. I’ve never needed to worry about adhering to a specific format for academic papers, and that’s the one and only reason I can see to use Essayist. In all other aspects, it seems a serviceable but very basic, almost primitive, word processor. There’s no support for embedding images or figures of any kind in a document, for example. [Correction: Essayist does support figures, but I missed the UI for how to insert them.] Detail iPad app of the year Detail bills itself, simply and to the point, as an “AI Video Editor”. The default subscription is $70/year ($5.83/month) with a 3-day free trial; the other option is to pay $12/month ($144/year) with no free trial. After a quick test drive, Detail seems like an excellent video editing app, optimized for creating formats common on social media, like reel-style vertical videos where you, the creator, appear as a cutout in the corner, in front of the video or images that you’re talking about. The iPhone version seems equally good. The iPad version of Detail will install and run on MacOS, but it’s one of those “Designed for iPad / Not verified for macOS” direct conversions. But they do offer a standalone Mac app, Detail Studio, which is a real Mac app, written using AppKit, which requires a separate subscription to unlock pro features ($150/year or $22/month). Detail only offers apps for iOS and MacOS — no Windows, Android, or web. From Apple’s own acclaim for Detail: When we used Detail to record a conversation of two people sitting side by side, the app automatically created a cut that looked like it was captured with two cameras. It zoomed in on one speaker, then cut away to the other person’s reaction. The app also made it easy to unleash our inner influencer. We typed a few key points, and the app’s AI wrote a playful script that it loaded into its teleprompter so we could read straight to the camera. Most importantly, Detail helped us memorialize significant life moments all while staying present. At a birthday party, we propped an iPad on a table and used Detail to record with the front and back cameras simultaneously. The result was a split-screen video with everyone singing “Happy Birthday” on the left and the guest of honor blowing out the candles on the right. (No designated cameraperson needed.) Detail has a bunch of seemingly genuinely useful AI-based features. But putting all AI features aside, it feels like a thoughtful, richly featured manual video editor. I suspect that’s why the AI features might work well — they’re an ease-of-use / automation layer atop a professional-quality non-AI foundation. Basically, Detail seems like what Apple’s own Clips — recently end-of-life’d — should have been. It turns your iPad (or iPhone) into a self-contained video studio. Cool. Of these three apps — Tiimo on iPhone, Essayist on Mac, and Detail on iPad — Detail appeals to me the most, and strikes me as the most deserving of this award. If I were to start making videos for modern social media, I’d strongly evaluate Detail as my primary tool. Apple still has no standalone category for AI apps, but all three of these apps emphasize AI features, and Apple itself calls out those AI features in its praise for them. It’s an obvious recurring theme shared by all three, along with their shared monetization strategies of being free to download with in-app subscriptions to unlock all features, and the fact that all three winners are exclusive to iOS and Mac (and, in Tiimo’s case, the web).
Read more →
Netflix Agrees to Buy Warner Bros., Including HBO, for $83 Billion
2025-12-05T16:47:44Z | Source: Daring Fireball
Meg James, reporting for The Los Angeles Times (News+ link): The two companies announced the blockbuster deal early Friday morning. The takeover would give Netflix such beloved characters as Batman, Harry Potter and Fred Flintstone. Fred Flintstone? “Our mission has always been to entertain the world,” Ted Sarandos, co-CEO of Netflix, said in a statement. “By combining Warner Bros.’ incredible library of shows and movies — from timeless classics like Casablanca and Citizen Kane to modern favorites like Harry Potter and Friends — with our culture-defining titles like Stranger Things, KPop Demon Hunters and Squid Game, we’ll be able to do that even better.” Not sure Squid Game belongs in the same comparison as Citizen Kane, but the Warners library is incredibly deep. Stanley Kubrick’s post-2001: A Space Odyssey films were all for Warner Bros. Netflix’s cash and stock transaction is valued at about $27.75 per Warner Bros. Discovery share. Netflix also agreed to take on more than $10 billion in Warner Bros. debt, pushing the deal’s value to $82.7 billion. [...] Warner’s cable channels, including CNN, TNT and HGTV, are not included in the deal. They will form a new publicly traded company, Discovery Global, in mid-2026. I don’t know if this deal makes sense for Netflix, but Netflix has earned my trust. Netflix is a product-first company. They care about the quality of their content, their software, their service, and their brand. If you care about the Warner/HBO legacy, an acquisition by Netflix is a much, much better outcome than if David Ellison had bought it to merge with Paramount. The LA Times article goes on to cite concerns from the movie theater industry, based on Netflix’s historic antipathy toward theatrical releases for its films. Netflix is promising to keep Warner Bros.’s film studio a separate operation, maintaining the studio’s current support for theatrical releases. I hope they do. I grew up loving going to the movies. I still enjoy it, but the truth is I go far less often as the years go on. Movie theaters shouldn’t be a protected class of business just because there’s so much affection and nostalgia for them. If they continue sliding into irrelevance, so be it. That’s how disruption, progress, and competition work. ★
Read more →
★ Alan Dye Was in Tim Cook’s Blind Spot
2025-12-05T01:53:12Z | Source: Daring Fireball
NBC News, back in March 2018: Speaking at a town hall event hosted by MSNBC’s Chris Hayes and Recode’s Kara Swisher, Cook said Facebook put profits above all else when it allegedly allowed user data to be taken through connected apps. [...] When asked what he would do if he were in Zuckerberg’s position, Cook replied: “What would I do? I wouldn’t be in this situation.” “The truth is we could make a ton of money if we monetized our customer, if our customer was our product,” Cook said. “We’ve elected not to do that.” “Privacy to us is a human right. It’s a civil liberty, and something that is unique to America. This is like freedom of speech and freedom of the press,” Cook said. “Privacy is right up there with that for us.” Perhaps Cook now needs to define “us”. This was a rather memorable interview. Cook’s “What would I do? I wouldn’t be in this situation” is one of the stone-coldest lines he’s ever zinged at a rival company. (In public, that is.) That was just ice cold. Cook is a consummate diplomat. Most non-founder big company CEOs are. Satya Nadella, Sundar Pichai, Andy Jassy — none of them are known for throwing shade, let alone sharp elbows, at competitors. Cook has made an exception, multiple times, when it comes to Facebook/Meta (and to a lesser degree, Google). So it’s not just that Alan Dye jumped ship from Apple for the chief designer officer role at another company.1 It’s not just that he left for a rival company. It’s that he left Apple for Meta, of all companies. Given what Cook has said about Meta publicly, one can only imagine what he thinks about them privately. Apple executives tend to stay at Apple. The stability of its executive team is unparalleled. But Dye is a senior leader who not only left for a rival, but the one rival that Cook and the rest of Apple’s senior leadership team consider the most antithetical to Apple’s ideals. It would have been surprising if Dye had jumped ship to Google or Microsoft. It would have been a little more surprising if he’d left for Amazon, if only because Amazon seemingly places no cultural value whatsoever on design, as Apple practices it. But maybe with Amazon it would have been seen as Andy Jassy deciding to get serious about design, and thus, in a way, less surprising after the fact. But leaving Apple for Meta, of all companies, feels shocking. How could someone who would even consider leaving Apple for Meta rise to a level of such prominence at Apple, including as one of the few public faces of the company? So it’s not just that Alan Dye is a fraud of a UI designer and leader, and that Apple’s senior leadership had a blind spot to the ways Dye’s leadership was steering Apple’s interface design deeply astray. That’s problem enough, as I emphasized in my piece yesterday. It’s also that it’s now clear that Dye’s moral compass was not aligned with Apple’s either. Tim Cook and the rest — or at least most? — of Apple’s senior leadership apparently couldn’t see that, either. I’d have thrown OpenAI in that list of companies where it would have been surprising, but not shocking, for Dye to leave Apple for. But that simply wasn’t possible given Jony Ive’s relationship with Sam Altman, LoveFrom’s collaboration with OpenAI with the io project, and Ive’s utter disdain for Dye’s talent, leadership, and personality. ↩︎
Read more →
Iterate.ai Launches AgentOne for Enterprise AI Code Security
2025-12-05 23:00 | Source: The New Stack
Iterate.ai is launching the GA release of AgentOne, an autonomous coding assistant that bakes security validation directly into AI code generation. This launch is a response to what the company’s official says is a growing crisis in enterprise development where traditional security reviews cannot keep up with AI’s accelerated output. The San Jose-based company is releasing AgentOne as a Visual Studio Code extension and through its Interplay platform, tackling a problem that Iterate co-founder and CTO Brian Sathianathan describes as AI coding tools now generating code at 100 times the speed of human developers, but security processes have not caught up. “Enterprise teams have been intoxicated by AI’s speed without addressing the elephant in the room,” Sathianathan said in a statement. “When you’re generating code at 100x velocity, a single vulnerability can multiply across services and trigger cascading failures in minutes.” Human developers used to write a couple of thousand lines of code per day. AI tools can now generate tens of thousands of lines per minute. That velocity has created what Iterate co-founder and CEO Jon Nordmark calls an “existential requirement” for enterprise software: figuring out how to maintain security and stability when applications can be built in minutes instead of months. “When you can generate a complete application in minutes, the question isn’t whether AI will transform development, it’s whether you’ll do it securely,” Nordmark said in a statement. Parallel Security Agents AgentOne’s approach centers on what Iterate.ai calls a Swarm Intelligence Architecture: specialized agents working in parallel to generate, validate and secure code simultaneously rather than sequentially. The system embeds OWASP Top 10 scanning, static code analysis and compliance checks into every stage of development instead of waiting for post-build security reviews. “We don’t just generate code faster; we orchestrate multiple security-focused agents simultaneously, embedding OWASP compliance checks, real-time architecture validation, and continuous security review into the development process itself,” Sathianathan said. The platform runs continuous validation, including memory leak detection, injection flaw scanning and race condition checks. It autogenerates architecture diagrams and dependency maps for security audits while parallelized security agents cross-check code in real time. When the system detects issues, it can automatically fix them — not just flag them for human review. Sathianathan brought up a demo where AgentOne ran a full security scan, identified 18 separate security errors, fixed them automatically, then ran an OWASP-comparable analysis. “A lot of times, finding vulnerabilities is one thing, but also fixing them is another,” he said. “That’s a benefit we bring to the table right out of the gate.” The company cites independent audits showing 99.7% security compliance, 60% reduction in vulnerabilities through real-time detection and 40% fewer production bugs. Those bug reductions come partly from confidence indicators that flag when the AI is uncertain and human oversight is needed. Context That Doesn’t Evaporate The other major constraint AgentOne addresses is context loss. Current AI coding assistants typically lose track of what they’re working on mid-project, forcing developers to constantly re-explain architecture and dependencies. GPT-5 maxes out at 272,000 tokens of context. Claude handles 200,000 tokens. Even Google‘s Gemini Pro only reaches 1 million tokens. AgentOne maintains 2 million tokens — roughly 10 times more than leading competitors, the company said. “Imagine you have millions of lines of code generated every day. How do you manage them?” Sathianathan said. That’s where the extended context matters. AgentOne’s Maestro Mode can coordinate interdependent tasks across multiple repositories, automatically debug errors, and maintain awareness across entire applications and multiweek development cycles. The system is built to handle codebases with more than 500,000 files without the context loss that kills productivity on complex projects. That means developers working on extended refactors or legacy system integrations don’t have to keep reteaching the AI about project structure. AgentOne also generates what Iterate.ai calls “block code” — drag-and-drop components for its Interplay platform. “We can not only generate regular code, we can also generate block code,” Sathianathan told The New Stack. “Because Interplay is a drag-and-drop platform, you can generate these Lego-like blocks so you can maintain and manage them.” Key Features AgentOne embeds live security validation into every stage of development: OWASP Top 10 scanning on every code change. Static code analysis to detect memory leaks, injection flaws and race conditions. Compliance checks against enterprise coding standards and regulatory frameworks. Architecture visualization that autogenerates component diagrams and dependency maps for security audits. Parallelized security agents that cross-check code in real time, preventing vulnerabilities from slipping through. Moreover, independent audits confirm the enterprise impact: 99.7% security compliance, matching or exceeding human review accuracy. 60% reduction in vulnerabilities, driven by real-time detection before deployment. 40% fewer production bugs, due to confidence indicators that signal when the AI is uncertain and human oversight is required. Built for Enterprise Control AgentOne offers on-premises deployment for organizations that need to maintain control of intellectual property and sensitive data. The platform supports multiple AI providers, including Anthropic Claude, OpenAI’s GPT-5, Google Gemini, and private LLMs via Amazon Bedrock, with automatic failover and load balancing. During a briefing with The New Stack, Sathianathan positioned AgentOne against enterprise-focused competitors like Augment Code and Blitzy. “Traditional tools like Cursor and Windsurf are individual developer toolsets,” he said. “This is more of a deep enterprise toolset.” Founded in 2013 by Nordmark, who previously founded eBags.com, and Sathianathan, an Apple veteran, Iterate.ai serves customers including Fujifilm, Circle K and Ulta Beauty. The company’s product lineup includes Interplay, its drag-and-drop AI application platform, and Generate, a privacy-first AI assistant for document analysis and business automation. AgentOne is available as a VS Code extension. Installation takes four clicks: Download the VSIX file, select “Install from VSIX…” in VS Code, and activate with your next project. The post Iterate.ai Launches AgentOne for Enterprise AI Code Security appeared first on The New Stack.
Read more →
Adenosine on the common path of rapid antidepressant action: The coffee paradox
2025-12-05 22:10 | Source: Hacker News
Comments
Read more →
How Capital One Cut Tracing Data by 70% With OpenTelemetry
2025-12-05 22:00 | Source: The New Stack
Organizations absolutely need to squeeze as much value out of their telemetry as they possibly can, for a number of reasons. Gathering telemetric data or observability is also a very tricky proposition, to say the least. On one hand, turning on the spigot to pull all metrics that an environment generates quickly becomes — to put it mildly — an unwieldy and unmanageable situation, not to mention unaffordable for most, if not all, organizations. Too little sampling of metrics data means that the data is likely missing key elements for debugging, interpreting or monitoring for potential outages and other problems. Optimizing operations and development becomes askew and inaccurate, or unreliable. Additionally, using the wrong sampling data for metrics is of little to no help. This problem is compounded, or this dynamic or dilemma is compounded, for very large enterprises such as, in this case, Capital One Bank. During the Observability Day event ahead of KubeCon + CloudNativeCon North America, Capital One engineers Joseph Knight and Sateesh Mamidala showed how they relied on OpenTelemetry to solve the tracing sampling data and were able to implement that across Capital One’s entire operations worldwide. Their efforts paid off: They reported a 70% reduction in tracing data volumes. It wasn’t an easy task, but OpenTelemetry served as the backbone for their gargantuan project, which they detailed in their KubeCon presentation. Capital One’s shift to @OpenTelemetry: Joseph Knight & Sateesh Mamidala, discussed why it was necessary during their Observability Day talk « From Data Overload To Optimized Insights: Implementing OTel Sampling for Smarter Observability » before #KubeCon NA. @linuxfoundation pic.twitter.com/qZMtmn4Jdx — BC Gain (@bcamerongain), Nov. 11, 2025 As Knight said during their talk, Capital One’s metrics involved dealing with “more than a petabyte per day without any sampling.” The solution required a deployment of dedicated infrastructure. Tail-based sampling requires turning it into a horizontally scaling problem, as you must “bring all the spans together for a trace before you can make a sampling decision,” Knight said. This, he added, resulted in layering collectors with a load-balancing exporter, a collector layer, and then a sampling processor layer, all entirely dedicated to tracing. Why Capital One Chose OpenTelemetry Over Vendor Tools Before adopting OpenTelemetry, Capital One’s engineers relied on vendor tools that implemented their own, often disparate, sampling strategies, typically providing only head-based sampling, in which the decision to keep a trace or not is made at the beginning of a request. OpenTelemetry “gave us the new perspective that head-based sampling is not very effective,” Knight said. The current approach with OTel offers two key benefits, Knight said. The first is that the centralized team now has control over the cost of distributed tracing. This control ensures that widespread adoption is possible with the available resources. Second, the team can provide guarantees to application teams that “they will be able to see certain behavior in their tool,” such as specific errors, which builds “a lot more comfort in how sampling affects the traces coming from their application,” Knight said. This, he added, “can’t be achieved with micro, probabilistic or deadly sampling.” Best Practices for Making Sampled Tracing Data Useful The key to making sampled data useful is the addition of tags. Capital One’s team adds tags to sampled traces to indicate how they were selected and at what probabilistic ratio they were sampled. This is useful in two ways, Knight said. Estimation: Teams can estimate the original trace data generated by multiplying the trace value by the probabilistic ratio, which gives an estimate for how many traces or requests were generated prior to sampling. Historical accuracy: By tagging the data directly, if the sampling ratios change over time, the original ratios are “baked in with the source data,” Knight said, allowing teams to look backward without seeing jumps over time. Furthermore, instead of relying on every span for rate information, teams should be taught to use metrics along with spans to get a more accurate picture of system behavior. “We export in the semantic invention metrics, histograms for every single span that we generate, both from the server on your client side,” Knight said. Using these metrics for accurate counts means “you don’t need every span to understand the rate of your system,” he said. “Building rules and guides for translating tools, alerts and dashboards to use metrics can make this transition easier.” The Strategic Shift From Head- To Tail-Based Sampling The shift from head-based to tail-based sampling, in which the sampling occurs at the end of the trace, has been a success, Knight said. The teams are now “very happy that they are getting a much more better picture now from the races than before,” he said. This is because tail sampling allows the decision to be made after receiving all the spans and looking at the entire trace. Despite the challenges of finding the right balance between high-rate and low-rate applications, the continued focus on dynamically adapting the tail sampling processor is key. The Capital One team aims to publish this research as an open source contribution. Ongoing Challenges and Future Goals in Data Sampling That 70% reduction in trace volume might be impressive, but the team is looking at the remaining 30% and asking, “How can we do better?” Knight said. The central challenge is a “tug of war” between high-frequency (high-rate) and low-frequency (low-rate) events in the probabilistic ratios, he said. High-rate applications can handle a much lower probabilistic rate, whereas low-rate applications get starved at a lower ratio. At scale, tailoring the rule set to every specific application is not feasible. The current focus is on building enhancements to the tail-sampling processor that will give the system the ability to, as Knight said, “adapt to the frequency of events we see dynamically, right without config changes on our side.” The post How Capital One Cut Tracing Data by 70% With OpenTelemetry appeared first on The New Stack.
Read more →
Frank Gehry has died
2025-12-05 21:31 | Source: Hacker News
Comments
Read more →
Leaving Intel
2025-12-05 21:27 | Source: Hacker News
Comments
Read more →
Perpetual Futures
2025-12-05 21:23 | Source: Hacker News
Comments
Read more →
The missing standard library for multithreading in JavaScript
2025-12-05 21:09 | Source: Hacker News
Comments
Read more →
Trae IDE Auto-Installs Python Libraries as You Code
2025-12-05 21:00 | Source: The New Stack
You’ve heard all about AI and IDEs. At this point, they are a dime a dozen, and many of them actually work pretty well. But what sets them all apart? A better UI? Better LLMs? Local AI? I’ve used several of these IDEs, and most often they all do the same things and do them fairly well. When I saw yet another such IDE, I had to find out if there was anything that set it apart from the others. It only took me about five minutes to figure out what makes Trae stand out. I’m going to show you what that is by way of creating a Python app that creates a Dungeons & Dragons character sheet. Yeah, let’s get nerdy. How To Get Trae Before we actually start using it, you might want to know how to install it. Trae can be installed and used for free (although you get more bang for your buck when you pay for a license) on macOS and Windows. There is also a waiting list for the Linux version, which you can sign up for on the project’s main site. I installed Trae on my MacBook Pro running macOS Tahoe, and it installed perfectly. After the installation was completed, I opened Trae and discovered that I did have to sign up for an account. No problem, as it was free. After signing up for an account and logging in, I was greeted by the Trae AI prompt (Figure 1). Figure 1: The Trae AI prompt is very easy to figure out. Alright, it’s time to get our D&D on. Using Trae for Nerdy Purposes With my decision made as to what I wanted Trae to do for me, I typed my prompt, which looked like this: After hitting Enter, Trae went to work. At first, everything ran like any other AI-powered IDE. Out of nowhere, however, Trae gave me a warning that there was a Python library that needed to be installed for the program to run. To my surprise, Trae offered to install it for me. Sure, Trae, go right ahead. It worked. In seconds, Trae had the missing library added, without me having to figure out the exact name of the library and use PIP to install it. Impressive. This actually happened three times, and each time Trae handled it with ease. I’m digging this. It took Trae roughly two minutes to create the program. I copied the resulting text into a file named dnd_character_creator.py and ran it with: The program asked me tons of questions related to creating a D&D character (Figure 2 – you know the routine). When the interrogation was complete, I could scroll through the terminal to see the results, but that’s all. Figure 2: Running my new D&D Character Creator in a macOS terminal window. Back to the AI prompt, where I said: Hit Enter, and Trae went back to work. Once again, Trae had to install another Python library, which I allowed, and it happened without fail. When Trae finished, I copied the new code into a new file and ran it. To my curiosity, the program didn’t write the results to a file, so I had to go back to the prompt and inform it that it hadn’t written the results to the file. It ran through the troubleshooting process and worked its magic. That’s when I realized something: I didn’t need to copy/paste the code because Trae actually wrote it to a file itself. Nice. I then changed into the folder /Users/jackwallen/Documents/trae_projects/DD/ and ran the correct file. Huzzah! It worked. I now have a Python script to help me create D&D characters. In the end, what I found that set Trae apart was its ability to install the necessary libraries required to create the program. I didn’t even need to know what libraries were necessary for the Python program, which was a big help. Understand that I only scratched the surface of using Trae, but even just using it without getting too deep into the woods, the IDE really impressed me. What other features does Trae offer? AI is integrated into the entire development process. Autonomous shipping with Trae Solo. Multiple agents for troubleshooting. Ability to create your own agent team. Structured “Builder mode” for complex projects Multimodal capabilities like image-to-code generation. Intelligent code completion. Conversational chat mode for coding help. Integrated debugging and testing. VS Code extension compatibility. As I mentioned, Trae can be used for free, but that plan is limited to: 10 Fast requests and 50 Slow requests of Premium models/month 1000 Requests of Advanced models/month 5000 Autocomplete/month If you upgrade to the paid plan $10/month (first month only $3), you get: 600 Fast requests and unlimited Slow requests of Premium models/month 300 bonus Fast request/month (limited-time offer) Unlimited Requests of Advanced models Unlimited Autocomplete If someone like me can create complex Python programs by way of AI queries and follow-ups for troubleshooting, anyone can. Give Trae a try and see if it doesn’t become your new favorite AI-powered IDE. The post Trae IDE Auto-Installs Python Libraries as You Code appeared first on The New Stack.
Read more →
Judge Signals Win for Software Freedom Conservancy in Vizio GPL Case
2025-12-05 20:42 | Source: Hacker News
Comments
Read more →
Fizz Buzz in CSS
2025-12-05 20:18 | Source: Hacker News
Comments
Read more →
Framework Sponsors CachyOS
2025-12-05 20:03 | Source: Hacker News
Comments
Read more →
How etcd Solved Its Knowledge Drain With Deterministic Testing
2025-12-05 20:00 | Source: The New Stack
The loss of institutional knowledge when people leave an organization can be tough. When longtime maintainers leave an open source project, it can be nearly impossible to recapture that knowledge. That’s what happened to etcd, an open source, distributed key value store that’s “older than Kubernetes itself,” said etcd’s lead maintainer, Marek Siarkowicz, in this episode of The New Stack Makers. Siarkowicz, a senior software engineer at Google, joined me for this On the Road episode of Makers, recorded at KubeCon + CloudNativeCon North America in Atlanta last month. The Challenge of Maintainer Turnover and Knowledge Loss Siarkowicz moved four years ago from Google’s Kubernetes team to its etcd team. Roughly three years ago, he told me, the etcd project hit some reliability challenges. As the team of maintainers worked on rolling out a new release, “a lot of maintainers left the project and were replaced with new maintainers, and there was a drain of knowledge. So all the properties that could not be written into the code were lost with those people. All the procedures, how to test, how to guarantee correctness that was done before were not done for the new release.” As a result, the team released a version that “has had multiple issues that were critical, like if the application crashed, it could cause an inconsistency.” Achieving the ‘Holy Grail’ for a Distributed System To remedy the situation, the new crew of maintainers implemented what it called “robustness testing.” To validate the project’s basic correctness, but also the distribution system’s correctness, the team built its own framework “inspired by” open source Jepson. The goal, Siarkowicz said, was to achieve linearizability — the ability to “have a distributed system that should behave like a single node. This is like a Holy Grail of distributed systems. And validating this is a very hard problem.” Solving it, the maintainers learned, meant they needed to bring forth their own failure injection mechanism. “We needed to teach people, the community, how to debug it, and all those challenges were immense,” Siarkowicz said. Underlying it all, he suggested, was a wish to create a knowledge base that wouldn’t disappear if team members left the project. Using Deterministic Simulation Testing to Recapture Knowledge Seeking a solution to all this, the etcd team reached out to Antithesis, which worked on deterministic simulation testing. Without this approach to software testing, locating and reproducing a bug in a distributed system can get dicey. “You have some hypothesis, you try to reproduce it, but you need to get lucky to sometimes find some race between multiple components or multiple logs and multiple, separate processes, communicating by network to find the bug.” Siarkowicz said. By contrast, he said, “deterministic simulation testing allows you to linearize everything, so there will be only one execution path and it’ll always be reproducible.” The collaboration with Antithesis, Siarkowicz said, made it easier to capture knowledge. The team could ”define the properties that were just in documentation or just in maintainers’ heads.” An advantage of using the Antithesis platform, he said, was the ability to test engineers’ assertions more robustly. “Previously, we already had assertions, but those were never tripped. So it seemed, Oh, like if it never trips, it should be good.” But that no-news-is-good-news approach, he suggested, deprived the team of deeper knowledge that more robust testing could reveal. Antithesis’s testing and failure injection went beyond what the maintainer team could build on its own, Siarkowicz said. “The failure combination that you need to do to trip is very hard to implement yourself, and it’s unique for every such property.” Addressing the Unique Testing Challenges in Open Source As the lead maintainer of an open source project, Siarkowicz said, teaching community members how to do more robust testing is a big challenge. Open source projects, he noted, “are like a tree. … at the beginning, the main part is the most important. But as the project grows, there is more community, they build out new features, new things. There are a lot of people who can work on the leaves, but the core is usually very sensitive, because it’s connected to everything.” When it comes to long-running projects like etcd or Kubernetes, he likes working on the core, the trunk, of those “trees.” But he acknowledged, those core parts are “not very accessible to most contributors, so having such an approach to testing can allow maintainers to write rules that will ensure that, even if a maintainer makes a mistake, or doesn’t have enough time to review something in full detail, we’ll still be able to catch it in the testing.” Check out the full episode for more about testing open source software, including the role AI may play in the future, and what’s on the etcd road map. The post How etcd Solved Its Knowledge Drain With Deterministic Testing appeared first on The New Stack.
Read more →
The Platform PM: Building an Ecosystem, Not Just a Product
2025-12-05 19:00 | Source: The New Stack
Some products solve a single problem. Others — known as platforms — quietly build the scaffolding for entire industries. Yet despite their impact, platforms remain widely misunderstood. They aren’t just tools or infrastructure. At their best, they’re environments where teams and developers can co-create value far beyond any single feature. This requires a shift in mindset. A feature-led product measures success by user growth; a platform lives or dies by integration rates, ecosystem health, and developer experience. After nearly two decades leading platform initiatives in GenAI and data integration, one lesson stands out: platform PM isn’t about control — it’s about enabling others to thrive. What Makes Platform Product Management Different? First, let’s clear up a common misconception: a platform is not just a product with an API bolted on. In traditional product management, the focus is relatively contained – solve a specific user problem, build a coherent interface, iterate quickly. Platform product management, however, operates on a different plane of complexity. You’re building foundational capabilities that serve internal teams, external developers, and business stakeholders — all at once. Consider the contrast: Product PM is obsessed with end-to-end user journeys. Platform PM must also obsess over how other teams’ products fit into — and depend on — yours. That means thinking about: APIs as first-class citizens, not afterthoughts. Documentation and onboarding as part of the product experience. Stability and backward compatibility, sometimes above raw speed of delivery. In my own work, particularly building GenAI infrastructure and integration layers, this complexity was front and center. Delivering an API that powers multiple internal data pipelines demands different rhythms than a classic SaaS launch. You have to consider who consumes your services, how they evolve over time, and what dependencies you’re quietly introducing across your organization. Equally important, platform PMs often operate without the luxury of visible, direct metrics like daily active users or NPS. Instead, success might look like this: Other teams can integrate faster. External developers report fewer blockers. Critical systems stay reliable at scale. In a way, platform PM is closer to city planning than app design. You’re laying roads and utilities that enable countless others to build — and your true impact only becomes obvious over time. Balancing Internal Needs and Developer Experience One of the defining challenges in platform product management is learning to serve two masters. Internal teams — engineers, product owners, data scientists — rely on your platform to move faster and experiment freely. External developers, meanwhile, expect stability, clear documentation, and predictable interfaces. What feels like progress to one group can look like disruption to the other. I’ve seen this firsthand. While leading a GenAI integration stack, internal teams needed rapid prototyping, but external partners demanded guarantees against unexpected changes. Balancing both required treating internal teams as real customers, with clear SLAs, versioning policies, and feedback loops. Some practical strategies that helped: Service-level agreements (SLAs) for internal teams, spelling out reliability targets and escalation paths. Developer experience (DX) champions, whose sole job was to advocate for consistent documentation and onboarding flows. Clear versioning strategies, so teams could migrate at their own pace rather than endure abrupt changes. Ultimately, success lies in seeing yourself not as a gatekeeper but as a steward. Your role is to balance competing needs without compromising trust on either side. Redefining Success Metrics: Beyond Classic KPIs When I led platform initiatives supporting GenAI and large-scale SAP integrations, I quickly learned that tracking surface-level metrics wasn’t enough. It was never just about whether teams connected to our APIs — it was whether those connections turned into real, lasting adoption. Did new workflows get launched? Did partner products scale faster? Did internal teams reduce time-to-market? If you measure a platform by the same KPIs you’d use for a standalone product, you’re likely missing the point. Traditional metrics like conversion rates or churn don’t capture whether a platform is genuinely enabling others to build and grow. That’s why platform success demands its own measures, often less visible but ultimately more telling. Some of the most valuable KPIs I’ve used include: Integration velocity: How long does it take a team to go from discovery to live integration? Ecosystem adoption: Are more teams and partners choosing the platform as their default? API reliability: What’s the uptime? How predictable is performance under load? Developer satisfaction: Are the people building on top of your platform actually happy with the experience? Real impact shows up when teams rely on your services in production, developers advocate for your platform unprompted, and your capabilities become the backbone of other products. Best Practices in Platform Strategy and Execution Great platforms don’t happen by accident. They’re the product of deliberate choices about how to design, prioritize, and sustain the systems that everyone else depends on. One of the most overlooked foundations is API strategy. It’s easy to treat APIs as a technical detail, but in practice, they’re often the most visible touchpoint between your platform and the outside world. That means consistency, clarity, and predictability matter as much as performance. A single undocumented change in an API could disrupt not only internal workflows but also partner commitments and commercial contracts. Some non-negotiables for API excellence: Versioning and backward compatibility: Never assume everyone will upgrade on your timeline. Clear, accessible documentation: Treat docs as part of the product, not an afterthought. Governance standards: Establish principles early — naming conventions, error handling, security expectations — and enforce them ruthlessly. Beyond APIs, platform PMs have a unique role in shaping system design. You’re often the one person bridging architecture discussions and business strategy. That means influencing big decisions: which capabilities to centralize, how much standardization to enforce, where to allow flexibility. In my experience leading cross-functional programs across the US, UK, EU, and India, this influence only works if you build trust with architects and engineers. Roadmap planning also looks different on a platform. You’re not just prioritizing features; you’re sequencing dependencies across teams. You have to ask: What does this unlock for others? What breaks if we delay it? How does it fit into our long-term narrative? One tactic I’ve relied on: visualizing roadmaps in layers — base infrastructure, core services, and enabling capabilities — so everyone sees how their work connects to the bigger picture. If I had to sum it up, the best platform PMs do three things consistently: Design for clarity, even in complex systems. Advocate for the developer experience, internally and externally. Plan with ecosystem impact in mind, not just individual deliverables. When you get these right, you create a platform people trust — and want to build on. Navigating Cross-Team Complexity For a Platform PM, one of the biggest complexities is the operating environment. Every decision affects multiple products, services, and teams that depend on the platform’s stability and evolution. Even in companies with mature product cultures, you’ll encounter competing priorities, hidden dependencies, and divergent incentives. One group might push for rapid delivery to meet quarterly goals, while another safeguards uptime for mission-critical systems. The Platform PM’s job is to align these worlds without eroding trust or reliability. This was especially true during large-scale integration programs, where a single API change could ripple across continents. Coordinating five or more teams, each with unique roadmaps, tech stacks, and timelines, demanded the same rigor you’d apply to architecture design — only this time applied to people and processes. A few principles that consistently help: Listen early: Understand each team’s priorities, strategy, scope, expectations and what risks or dependencies they perceive. Co-create solutions: Invite stakeholders of the dependent teams into architecture and rollout planning so they share ownership. Over-communicate intent: Platform evolution often feels disruptive to consuming teams. Explaining the “why” behind roadmap shifts builds alignment and reduces resistance. Make dependencies visible: Use layered roadmaps and integration charts to show how changes cascade through the ecosystem. This prevents local optimizations from undermining global stability. Ultimately, Platform PM is a discipline of orchestration — aligning teams, technology, and timelines so the entire ecosystem can evolve together. You can’t eliminate complexity, but you can replace confusion with context, and that’s what keeps the platform — and everyone who depends on it — moving forward in sync. Conclusion Platform product management isn’t glamorous in the traditional sense, but your impact is deeper and longer-lasting than almost any other kind of product work. Because a great platform is an environment where others can build, adapt, and grow, it’s the quiet infrastructure that makes speed and innovation possible. And it’s the relationships across teams, companies, and entire industries that define whether your product is merely used or truly trusted. The post The Platform PM: Building an Ecosystem, Not Just a Product appeared first on The New Stack.
Read more →
Why we built Lightpanda in Zig
2025-12-05 18:29 | Source: Hacker News
Comments
Read more →
JetBrains CEO on How Developers Become Leaders
2025-12-05 18:00 | Source: The New Stack
JetBrains CEO Kirill Skrygan, 38, was all in on tennis — he played and was one of the best in the city of St. Petersburg. But at the ripe age of 10, his parents realized that the funnel of tennis professional begins very wide but quickly narrows for people who want to become professionals. So they took him to a Russian mathematical school, where he learned how to program. He went on to attend St. Petersburg State University, where he became a software engineer. The Netherlands native began as a junior developer working for American health care companies. In his early 20s, he joined JetBrains as a team lead, eventually moving up to become the CEO of JetBrains, which specializes in creating integrated development environments (IDEs) such as IntelliJ IDEA, PyCharm and WebStorm. He joined the company as a junior developer in 2010. Skrygan spoke with The New Stack about his journey from junior developer to CEO, and shared his advice for how other developers can make the transition from coders to management. Why He Moved to Management Skrygan is very clear on what brings him purpose in his work, which ultimately lead to him to transition into management. “What drives me is actually the impact I can bring to the overall technological landscape of the whole [of] humanity,” he said. He spent 10 years leading the Rider IDE team, where the desktop application developer fell in love with the cross-platform language, Kotlin, which was used to create Rider’s frontend. “I immediately fell in love with Kotlin because it’s so elegant and […] so flexible,” Skrygan said. “What amazed me from day one is [that] it could be very strict and enterprise-ish (like Java) on one hand, but on the other hand it might be very, very hipster-ish — like OCaml or Scala, or something like that.” Then he became the department head across JetBrain’s IDEs, where he managed approximately 650 people. “You can actually drive the business strategy, product strategy, marketing, everything. I love it,” he said. But even when developers want to move into management, there can be bias and barriers that hold them in engineering roles. For instance, some companies suffer from a “grass is always greener on the other side” mentality that promotes outsiders over existing employees. I asked Skrygan if that was an issue he had encountered. “You need to invest in hiring juniors [and] interns, raising them, because sometimes juniors are so active, passionate, and this is just very important for the whole spirit of the company, of the team.” – Kirill Skrygan, JetBrains CEO That wasn’t an issue at JetBrains, he replied, adding that companies should have to cultivate both promoting internal talent while scouting strong recruits. “You couldn’t rely only on external ‘grass,’” he said. “You have to raise your own talents. Moreover, it’s not just about highly-paid stars you would hire from the outside. Yes, you need to hire those, but at the same time, you need to invest in hiring juniors [and] interns, raising them, because sometimes juniors are so active, passionate, and this is just very important for the whole spirit of the company, of the team.” It helps to be in a company that prioritizes cultivating internal talent. JetBrians, for instance, collaborates with universities on internship programs. “We’re doing a lot to invest our own money to educate young generation people; and yes, we also hire some of these talented people to JetBrains,” he said. Cultivating Management Skills as a Developer I asked Skrygan what skills or benefits programmers bring to management. He replied that good software engineers tend to be very structured and use engineering systems. They understand that you need a solid architecture before developing an application, he said. They have to show logical thinking about both structure and architecture, he added. “This level, this way of thinking, is very good for managers because when you define business strategy, it’s basically some logic based on presumptions,” he said. “You have some sort of architecture of logic based on some prerequisites. So this structurality, this logical thinking, really helps.” It’s a cliché to say programmers are introverts — but whether you are or aren’t, programmers who want to move into management should develop and demonstrate people skills if they want to move into management, according to Skrygan. This can take some real work and study, he added. “Being a manager is not like writing code. You have to be empathetic. You have to work with people. You have to understand their things,” he said. Developing people skills isn’t a simple step. You don’t just “solve” the people problem and move on. “You have to be individual with all the people,” he suggested. “You cannot be, like, one size fits all for different kinds of people.” Technical managers must have both logical capabilities and the ability to relate to and manage people. – Skrygan Only a small percentage of engineers have that skill, he said, but they need to have both the logical capabilities and the ability to relate to and manage people. “What I would suggest is to dive deep — more into social, humanitarian aspects and sciences, psychology, group psychology, sociology, or some other things, because this just gives different angles,” he said. “Tech people are very logical thinkers, and they have their own strict angle, and sometimes they do not understand why humans, or humans at scale, behave this way, this strange way. It’s silly, but it’s the way it is, and you have to acknowledge that, and understanding [other] sciences should definitely help.” He added it helps to be a very agreeable person, which some engineers are not. This can make it difficult to advance in the corporate hierarchy, he said. One thing engineers can do to show they’re agreeable is to realize roles overlap now. That means developers should be ready and willing to help with domains outside their speciality, such as product management or marketing. But at the same time, you also have to balance that with having strong, deep opinions. “You have to show it in, of course, in [a] correct way, so… the management understands, hey, he’s not just about this narrow scope; this person is about much broader sense-making, and this person has an opinion about that,” he said. “That’s valuable.” He also recommended getting an MBA or taking MBA-style courses to understand business. Shifting Onto the Management Track But after acquiring management skills, how can developers convince their company to give them a chance? That will depend somewhat on the company culture, but he said at a basic level it means getting recognition from management. “If I can generalize these things, I think that proactiveness and initiative right now would be also interesting,” he said. “Just being an operator in a very transparent way is not quite enough.” For instance, if you’re given a job, do it with initiative and proactivity. That might look like, for instance, taking charge of tickets to ensure they’re handled if your company just pools them all and expects developers to just do them in their “spare” time. Developers should also realize that the feedback loop for managers is different from that for developers. It’s much more complicated, he said, but it’s necessary. “From my experience, people who are not learning because they’re too stubborn or too stuck up, [that] usually prevents them from being a good managers.” – Skrygan “It works through layers of people, levels of organizations, but you have to be honest with yourself; you have to get the feedback, and you have to improve yourself,” he said. “Being able to frankly get this feedback and get better is important stuff. From my experience, people who are not learning because they’re too stubborn or too stuck up, [that] usually prevents them from being a good managers.” He also said there’s a difference between management and leadership. Leadership is actually about saying no rather than yes, he noted. That’s because leadership is giving a direction, a focus. “It’s very easy to say yes to everyone, but if you will say yes to all the ideas you will have, … you will not deliver,” he said. “You need to lead people towards this focus, and you need to engage and inspire people so they actually want to do this.” Finally — although it may be old-fashioned in today’s work world — Skrygan also believes loyalty is an important trait for those who want to move into management. “It sounds silly, but it’s sometimes about some of the projects that your managers can delegate to you as a developer, which might not contribute to your salary bonus by the end of the year, but this is what the management asks you to do,” he said. “If you do this, they understand: hey, this is a person we can rely on, who values the interest of the whole company, of the whole organization, even more than their own.” The post JetBrains CEO on How Developers Become Leaders appeared first on The New Stack.
Read more →
Onlook (YC W25) the Cursor for Designers Is Hiring a Founding Fullstack Engineer
2025-12-05 17:00 | Source: Hacker News
Comments
Read more →
AI Can Deliver Deployment-Aware Risk Analysis for Kubernetes
2025-12-05 17:00 | Source: The New Stack
For Kubernetes platform engineers or DevSecOps leads, the experience is all too familiar: You open your security dashboard and are greeted by a list of 10,000 deployments, all flagged with critical vulnerabilities, configuration issues and suspicious activities. The sheer volume of alerts creates a paradox: When everything is a priority, nothing is. Traditional risk scoring solutions evaluate the risk indicators detected by scanners in isolation, relying on predefined heuristics and static vulnerability scores. These solutions prioritize risks largely based on these static labels, but do not consider whether these risks are truly applicable to the specific deployment environment or whether they pose an actual exploitation path. Addressing this lack of context is an area of focus for Red Hat, in collaboration with IBM Research, as they develop future capabilities for Red Hat Advanced Cluster Security. By introducing an AI-driven Risk Investigation Agent, the teams are moving away from static scoring toward “deployment-aware” risk analysis. The Problem: The Context Gap In many current Kubernetes security practices, risk scores are often assigned based on static metadata rather than the actual behavior of the deployment in its live environment. Determining true risk requires understanding whether the vulnerable library is loaded at runtime, whether the affected port is exposed or whether the workload is even active. Configuration weaknesses may intensify the impact of certain vulnerabilities, and multiple common vulnerabilities and exposures (CVEs) within the same deployment may interact to form chained exploitation paths. One vulnerability may enable or support the exploitation of another, creating an exploit chain. Moreover, behavioral indicators such as anomalous processes, unusual network activity or unauthorized access attempts may signal an ongoing exploitation attempt. These signals must be correlated with vulnerability data and deployment context to produce accurate and meaningful risk assessments. The goal of the new collaboration is to refine risk scoring based on real deployment context. To do this, the system addresses two critical gaps in traditional scanning: Deployment-aware risk assessment: Using AI to correlate findings detected by Red Hat Advanced Cluster Security to deliver deployment-aware risk assessments. This includes evaluating the applicability of each risk indicator to the actual deployment context, such as determining whether a CVE is truly exploitable within a specific workload. It also includes correlating multiple indicators to identify cases where they combine to create amplified or chained risks. Context and explainability: Using the capabilities of large language models (LLMs) to generate clear, natural language explanations that describe the specific factors influencing the risk score. This provides customers with transparency into how each assessment was derived, enables them to validate the quality of the AI-driven insights and helps them better understand the underlying risk. The Solution: The Risk Investigation Agent The core of this new capability is the Risk Investigation Agent developed by IBM Research Labs for use with Red Hat Advanced Cluster Security. This feature is designed as an add-on for users with the resources to power an LLM-based agent. It functions through a sophisticated flow designed to provide more context-aware risk assessment: Data aggregation: The agent continuously ingests data from Red Hat Advanced Cluster Security services, including vulnerability scan results, runtime process monitoring, network activities, Kubernetes configuration metadata and access events. It also enriches this view using external sources such as CVE databases, Exploit DB intelligence, MITRE ATT&CK tactics and remediation guidelines. Investigation agent (the “brain”): This component serves as the reasoning layer. Its primary role is to determine whether each finding represents a true, exploitable risk within the live deployment. It evaluates network exposure, workload behavior, configuration posture and runtime evidence to assess whether the prerequisites for exploitation are actually present. This includes verifying if the vulnerable component is loaded, whether the service or port is exposed and whether the workload is active and reachable. Beyond individual findings, the agent also performs cross-correlation across signals. It identifies when configuration weaknesses amplify a vulnerability, when suspicious process execution or unusual network traffic suggests active exploitation or when multiple vulnerabilities combine to form a potential exploit chain. LLM processing and risk explanation: Once enriched and contextualized, the data is processed by an LLM to generate a refined generative AI (GenAI) risk score. More importantly, the LLM provides a natural-language explanation describing why the risk is significant, referencing specific deployment behaviors, potential exploit paths, chained vulnerabilities and observed indicators of compromise. This enables security teams to understand not just the risk level, but the reasoning behind it. Under the Hood: How the AI ‘Thinks’ To understand the value here, let’s look at a specific evaluation scenario. Consider a Windows Server Update Services (WSUS)-like service running on a Kubernetes deployment. A standard scan might flag CVE-2025-59287, a remote code execution vulnerability targeting WSUS over TCP ports 8530 and 8531. The false positive: In one cluster, Red Hat Advanced Cluster Security detects that the vulnerable WSUS package exists in the image, but during runtime analysis, it confirms that TCP ports 8530 and 8531 are closed, with no network exposure. There is also no indication of any WSUS-related process activity. The LLM determines that although the library is present, the vulnerability is “not exploitable under current configuration” and marks the exploit suspicion as False, effectively deprioritizing it. The true positive: In another deployment, Red Hat Advanced Cluster Security observes that ports 8530 and 8531 are open and reachable. Runtime network monitoring detects internal port scanning attempts targeting these ports from another pod. The LLM identifies these not as generic system events, but as behavior strongly correlated with remote code execution probing. It flags this as “Highly relevant – suspicious” port scanning activity associated with CVE-2025-59287, marking it as “True.” The system then generates a human-readable summary: “The risk is related to the exposed WSUS service running on unpatched containers with open TCP ports 8530/8531. Detected anomalous port scanning activity in the cluster increases the likelihood of exploitation and contributes to the overall risk score.” Explainability: Interactive, Environment-Aware Insights While traditional AI explainability focuses on clarifying how a risk score is calculated, additional capabilities are being developed to take Red Hat Advanced Cluster Security a step further by making the system interactive and responsive to the deployment environment. The goal is that platform engineers and administrators will be able to query the AI about specific workloads or configurations and receive clear, contextual answers tailored to their environment. This interactive explainability allows users to provide feedback directly to the model. For example, if a deployment is flagged as high risk but the user knows it is a temporary sandbox, they can annotate that context. The system then incorporates this feedback, continuously adapting and refining its understanding of the enterprise environment. The result is a “white box” AI that not only explains its reasoning but learns from the environment and user input, enabling more accurate, actionable and trustable guidance. The Road Ahead: From Analysis to Remediation IBM and Red Hat are exploring capabilities that enable the AI to proactively propose remediation actions tailored to the specific deployment context. Future iterations aim to generate remediation options that users can apply directly to mitigate identified risks. These include risk-aware patching strategies aligned with the environment’s operational constraints, mitigation steps for vulnerabilities that cannot be patched immediately and configuration changes to reduce exposure and harden the deployment. The integration of GenAI into Red Hat Advanced Cluster Security represents a maturity milestone for Kubernetes security. We are moving past the era of simple pattern matching and into an era of contextual understanding. By combining IBM’s research in correlation analysis with Red Hat’s platform capabilities, Red Hat Advanced Cluster Security is attempting to solve the signal-to-noise ratio problem that plagues modern SecOps. For the IT manager, this means less time chasing false positives. For the Kubernetes user, it means a clearer understanding of what is actually running in their clusters. The post AI Can Deliver Deployment-Aware Risk Analysis for Kubernetes appeared first on The New Stack.
Read more →
Patterns for Defensive Programming in Rust
2025-12-05 16:34 | Source: Hacker News
Comments
Read more →
Gemini 3 Pro: the frontier of vision AI
2025-12-05 16:15 | Source: Hacker News
Comments
Read more →
I'm Peter Roberts, immigration attorney who does work for YC and startups. AMA
2025-12-05 16:04 | Source: Hacker News
Comments
Read more →
AI Agents Are Morphing Into the ‘Enterprise Operating System’
2025-12-05 16:02 | Source: The New Stack
Most of the conversation around AI agents today revolves around bots writing code. This didn’t come out of nowhere; Software engineering is the most common use case for AI systems, and code-writing tools are reaching eye-popping valuations. But inside companies, something more fundamental is shifting: AI agents are becoming internal “operating systems” that connect and orchestrate data flows between software tools, changing the way we all work, not just the engineers. At Block, our engineers built an AI agent framework called goose and released it as an open source tool for anyone to use with any large language model. Initially designed for writing code, we quickly realized that for goose to reach its full potential, it needed a standard way to communicate with the dozens of tools that people use daily. Recognizing this same challenge, Anthropic was developing what would become the Model Context Protocol (MCP). We began collaborating early in MCP’s development to help shape this open standard that bridges AI agents with real-world tools and data. Today, 60% of our workforce — around 6,000 employees — use goose weekly. It serves as a central conductor, reading and synthesising data across dozens of MCP-powered extensions including Slack, Google Drive, Snowflake, Databricks, Jira and others. Just months ago, it would take days of manual labor to read Snowflake dashboards, pull context from recent Slack chatter and generate a weekly Google Doc with insights and flagged anomalies. Now humans orchestrate this process in minutes, directing goose to the relevant data while applying judgment about what matters most. Unlike the headlines, this isn’t a story about AI replacing jobs. At Block, we believe the shift is about redistributing access to problem-solving. The Compression Effect: Becoming More Self-Sufficient Most companies rely on handoffs. A product manager submits a ticket. An engineer builds it. A support team flags a recurring issue. A developer scripts a fix. These workflows protect quality, but they slow things down. AI agents like goose are collapsing that distance by helping people take action on their own instead of waiting on others. Take customer support escalations. In the past, when a support agent noticed an unusual spike in refunds, they would file an escalation ticket and wait three to five days for the data team to pull transaction analysis, receive raw spreadsheets, manually create a summary and post findings to Zendesk. Now that same agent asks goose to “analyse the last 30 days of refund spikes” and within 30 seconds receives a complete analysis with patterns identified and an automatically generated Zendesk-ready summary. By allowing users to choose a preferred model and by connecting to internal tools, goose enables teams to move from idea to prototype without waiting in a queue. A support agent can surface a dashboard. A security analyst can write a detection rule. A designer can test live functionality based on user feedback. None of this requires code expertise. This kind of access was previously off-limits to most employees. That’s starting to change. What’s Next: Building Guardrails and Resilience Goose is part of a wider shift within Block and at other forward-thinking companies: recognising that AI’s most valuable role may not just be in what it builds for users, but in what it unlocks for teams. By lowering the barrier to experimentation, internal AI tools are giving people the confidence to test, iterate and solve problems themselves. This doesn’t remove the need for engineers. If anything, it strengthens their impact. It clears the backlog. It reduces bottlenecks. And it makes the space for more complex, strategic work to get done. As with any new expansion of capabilities like this, this type of transformation requires careful design. At Block, we’ve implemented specific policies that govern how these AI connections work across our company. Any tool that handles sensitive information requires legal approval before it can be deployed. We maintain curated lists of approved extensions, so employees can only install tools that have passed our security review. And we’ve built smart boundaries directly into the tools themselves. Some automatically avoid accessing confidential databases, while others separate what users can read versus what they can modify. These aren’t bureaucratic barriers; they’re design choices that let teams move fast while keeping important information secure. The long-term opportunity isn’t just speed or cost savings. It’s resilience. Companies that embrace this shift will be less dependent on rigid workflows and more responsive to the people closest to the problem. They’ll be able to move faster without compromising safety, and solve at the edge without losing control at the core. That’s what we’re learning with goose. And that’s the direction we believe enterprise AI is headed. It may not make headlines, but it’s changing the way organizations function at their core. The post AI Agents Are Morphing Into the ‘Enterprise Operating System’ appeared first on The New Stack.
Read more →
Cloudflare outage on December 5, 2025
2025-12-05 15:35 | Source: Hacker News
Comments
Read more →
DevOps Is Still Waiting for Its Cursor Moment
2025-12-05 15:00 | Source: The New Stack
It’s 2:47 am. Your phone is buzzing. Production alerts. The checkout service is throwing 5xx errors and customers are abandoning carts and the on-call engineer is flipping between Datadog, Argo CD, kubectl and logs. She’s just trying to figure out what changed. Latency spiked 20 minutes ago. A deployment went out at 2:31 am. Two pods are in CrashLoopBackOff. Memory limits were changed. She rolls back, updates the ticket, writes the postmortem and… tries to go back to sleep. Yet she knows she’s gonna go through some version of this again next week. Meanwhile, her colleague refactored an entire module in Cursor in minutes, because of AI. The AI understood the codebase, proposed the change and handled the tedious parts. And it did it all automatically. What happened? AI has transformed the way we write code. But it has not transformed the way we operate the infrastructure to run that code. The Gap Continues to Grow Wider In the past two years, AI has reshaped the way developers work: Cursor and Copilot write and refactor code. Tools like Lovable, v0 and Bolt generate frontends. Replit agents scaffold and deploy full applications. But DevOps work remains manual. Engineers still have to resolve incidents by: Copying from runbooks Hopping between tools Relying on tribal knowledge Keeping Infrastructure as Code (IaC) updated Incidents still stall releases. Backlogs still grow. AI has supercharged development, while operations remained stuck. This isn’t a market oversight. This problem is much, much harder. Why Operating Infrastructure Is So Much Different 1. There’s No Buffer for Mistakes A bad code suggestion fails in a branch. A bad infrastructure change will immediately affect live traffic. Every action in DevOps has a blast radius: Pods die, security groups break connectivity and pipelines cause a cascade of failures. 2. The Context Surface Is Huge An AI for DevOps has to synthesize: Production vs. Dev The state of Kubernetes Code repos for Terraform / Infrastructure as Code. CI/CD runs Observability signals Cloud provider configuration Cost data Compliance constraints So your code assistants will only need the file and its neighbors. With DevOps, you’ve got to have whole-stack awareness. 3. Every Environment Is Unique There’s no universal model that defines the shape of your infrastructure. Every company has custom terraform modules, custom pipelines, deployment strategies, alert rules and dashboard logic. A generic AI just can’t operate safely. 4. Governance Is Mandatory Real infrastructure demands: Role-based access control (RBAC) Approvals Audit logs Compliance evidence No AI can bypass these processes. It has to be able to integrate with them. Why Existing Tools Fall Short It’s tough. Plenty of products address slices of the problem: Runbook automation executes predefined scripts. AIOps platforms group alerts. Observability tools diagnose anomalies. Incident management tools route and escalate responders. Coding copilots help make changes to IaC Sure. These are all useful. But none operate in the same way as Cursor does for application code. What a ‘Cursor for DevOps’ Has To Have To make a Cursor for DevOps work, you’ve got to have a few things: It Has To Run Inside Your Cloud Infrastructure and data are sensitive. A viable system has to sit in the customer’s virtual private cloud, , use identity and access management, and rely on cloud native large language models (LLMs) like Amazon Bedrock. It Needs a Unified Orchestration Layer IaC, Kubernetes, CI/CD, observability, cost and compliance are all separate domains, right? The AI needs a coordinator who can handle: Identity Context sharing Tool integration Multistep workflows Infrastructure as Code You’ll Need a Well-Designed Human-in-the-Loop System Here’s the step-by-step process: AI observes and proposes. Humans approve code and infrastructure changes AI executes. Everything is logged. This is the only way production can work well. Native RBAC Is Essential Agents have to be able to inherit the exact permissions of the people they represent. And the elevation has to arrive just in time. Domain-Specific Agents With Deep Expertise Are the Key to Success You don’t want one giant model. You want specialized agents, like: Kubernetes agent CI/CD agent Observability agent Compliance agent Cost optimization agent Code IDE integrated agents Each one has deep knowledge of its domain. And it’s a single orchestration layer that ties them together. Infrastructure has many separate problems, and you need agents that specialize in Kubernetes, CI/CD, observability, compliance and cost management. These agents make smarter decisions and stay closer to real DevOps work. They can also work together: One agent can flag an issue, another can fix it by either making a config or code change, and a third can verify it, so complex workflows get handled correctly. Early Results Show the Path Forward We’ve witnessed teams piloting these architectures. They’re already seeing: MTTR reductions of 40 to 70 percent Ticket volumes are dropping dramatically Provisioning cycles are shrinking from weeks to hours Automatic evidence and continuous control checks These gains come from allowing AI to handle the predictable work. So you don’t have exhausted DevOps teams anymore. AI can now analyze signals, recognize known patterns, execute approved remediations, provision environments and capture audit data behind the scenes. The goal isn’t to replace engineers. The goal is to give them leverage. The Cursor Moment Is Coming No, the complexity of infrastructure hasn’t changed. But AI capabilities have. The architectural patterns now exist to apply AI on both development and operations safely. Over the next 18 months, we’re sure to see: Better cross-agent orchestration Deeper tool integrations Richer contextual reasoning Smoother alignment with existing workflows Beautiful IaC coding experiences. DevOps has waited for its Cursor moment, and the ingredients are finally in place. We’re building the AI DevOps Engineer at DuploCloud so you’ll get AI agents that: Run inside your cloud, understand your infrastructure, execute real DevOps tasks with built-in governance and compliance and help write and run your IaC. Learn more about the DuploCloud AI DevOps Engineer. The post DevOps Is Still Waiting for Its Cursor Moment appeared first on The New Stack.
Read more →
Most technical problems are people problems
2025-12-05 13:07 | Source: Hacker News
Comments
Read more →
Making RSS More Fun
2025-12-05 13:00 | Source: Hacker News
Comments
Read more →
Netflix to Acquire Warner Bros
2025-12-05 12:21 | Source: Hacker News
Comments
Read more →
UniFi 5G
2025-12-05 07:06 | Source: Hacker News
Comments
Read more →
Exploring Syntropic Frameworks in AI Alignment: A Philosophical Investigation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03048v1 Announce Type: new Abstract: I argue that AI alignment should be reconceived as architecting syntropic, reasons-responsive agents through process-based, multi-agent, developmental mechanisms rather than encoding fixed human value content. The paper makes three philosophical contributions. First, I articulate the ``specification trap'' argument demonstrating why content-based value specification appears structurally unstable due to the conjunction of the is-ought gap, value pluralism, and the extended frame problem. Second, I propose syntropy -- the recursive reduction of mutual uncertainty between agents through state alignment -- as an information-theoretic framework for understanding multi-agent alignment dynamics. Third, I establish a functional distinction between genuine and simulated moral capacity grounded in compatibilist theories of guidance control, coupled with an embodied experimental paradigm and verification regime providing operational criteria independent of phenomenological claims. This paper represents the philosophical component of a broader research program whose empirical validation is being developed in a separate project currently in preparation. While the framework generates specific, falsifiable predictions about value emergence and moral agency in artificial systems, empirical validation remains pending.
Read more →
Beyond the Black Box: A Cognitive Architecture for Explainable and Aligned AI
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03072v1 Announce Type: new Abstract: Current AI paradigms, as "architects of experience," face fundamental challenges in explainability and value alignment. This paper introduces "Weight-Calculatism," a novel cognitive architecture grounded in first principles, and demonstrates its potential as a viable pathway toward Artificial General Intelligence (AGI). The architecture deconstructs cognition into indivisible Logical Atoms and two fundamental operations: Pointing and Comparison. Decision-making is formalized through an interpretable Weight-Calculation model (Weight = Benefit * Probability), where all values are traceable to an auditable set of Initial Weights. This atomic decomposition enables radical explainability, intrinsic generality for novel situations, and traceable value alignment. We detail its implementation via a graph-algorithm-based computational engine and a global workspace workflow, supported by a preliminary code implementation and scenario validation. Results indicate that the architecture achieves transparent, human-like reasoning and robust learning in unprecedented scenarios, establishing a practical and theoretical foundation for building trustworthy and aligned AGI.
Read more →
When Do Symbolic Solvers Enhance Reasoning in Large Language Models?
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03272v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) achieve strong performance on complex reasoning tasks by generating long Chains of Thought (CoTs). However, this paradigm might incur substantial token overhead, especially when models "overthink" by producing lengthy reasoning chains, which can even lead to incorrect answers. A promising direction is the symbolic-solver-integrated approach, which leverages the code generation capabilities of LLMs to translate reasoning tasks into executable code and then solve them with a symbolic solver. In this paper, we explore an open question of when the conventional long-CoT can be enhanced by symbolic solvers. Our experimental results show that the symbolic-solver-integrated method only helps when the problem requires limited implicit reasoning but involves an ample search space. The latest LLMs, like GPT-4o, show better performance on deductive problems with shallow reasoning depth, while the symbolic-solver-integrated method significantly improves the LLMs' performance in constraint satisfaction problems that require repeated backtracks. When a declarative exemplar is provided, even CodeLlama-13B can outperform GPT-4o in difficult Zebra puzzles.
Read more →
Prior preferences in active inference agents: soft, hard, and goal shaping
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03293v1 Announce Type: new Abstract: Active inference proposes expected free energy as an objective for planning and decision-making to adequately balance exploitative and explorative drives in learning agents. The exploitative drive, or what an agent wants to achieve, is formalised as the Kullback-Leibler divergence between a variational probability distribution, updated at each inference step, and a preference probability distribution that indicates what states or observations are more likely for the agent, hence determining the agent's goal in a certain environment. In the literature, the questions of how the preference distribution should be specified and of how a certain specification impacts inference and learning in an active inference agent have been given hardly any attention. In this work, we consider four possible ways of defining the preference distribution, either providing the agents with hard or soft goals and either involving or not goal shaping (i.e., intermediate goals). We compare the performances of four agents, each given one of the possible preference distributions, in a grid world navigation task. Our results show that goal shaping enables the best performance overall (i.e., it promotes exploitation) while sacrificing learning about the environment's transition dynamics (i.e., it hampers exploration).
Read more →
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03318v1 Announce Type: new Abstract: Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction and are increasingly being deployed in situations where they might engage with both human and artificial agents. These interactions represent a critical frontier for LLM-based agents, yet existing evaluation methods fail to measure how well these capabilities generalize to novel social situations. In this paper, we introduce a method for evaluating the ability of LLM-based agents to cooperate in zero-shot, mixed-motive environments using Concordia, a natural language multi-agent simulation environment. Our method measures general cooperative intelligence by testing an agent's ability to identify and exploit opportunities for mutual gain across diverse partners and contexts. We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains across a suite of diverse scenarios ranging from negotiation to collective action problems. Our findings reveal significant gaps between current agent capabilities and the robust generalization required for reliable cooperation, particularly in scenarios demanding persuasion and norm enforcement.
Read more →
Multimodal Reinforcement Learning with Agentic Verifier for AI Agents
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03438v1 Announce Type: new Abstract: Agentic reasoning models trained with multimodal reinforcement learning (MMRL) have become increasingly capable, yet they are almost universally optimized using sparse, outcome-based rewards computed based on the final answers. Richer rewards computed from the reasoning tokens can improve learning significantly by providing more fine-grained guidance. However, it is challenging to compute more informative rewards in MMRL beyond those based on outcomes since different samples may require different scoring functions and teacher models may provide noisy reward signals too. In this paper, we introduce the Argos (Agentic Reward for Grounded & Objective Scoring), a principled reward agent to train multimodal reasoning models for agentic tasks. For each sample, Argos selects from a pool of teacher-model derived and rule-based scoring functions to simultaneously evaluate: (i) final response accuracy, (ii) spatiotemporal localization of referred entities and actions, and (iii) the quality of the reasoning process. We find that by leveraging our agentic verifier across both SFT data curation and RL training, our model achieves state-of-the-art results across multiple agentic tasks such as spatial reasoning, visual hallucination as well as robotics and embodied AI benchmarks. Critically, we demonstrate that just relying on SFT post-training on highly curated reasoning data is insufficient, as agents invariably collapse to ungrounded solutions during RL without our online verification. We also show that our agentic verifier can help to reduce reward-hacking in MMRL. Finally, we also provide a theoretical justification for the effectiveness of Argos through the concept of pareto-optimality.
Read more →
Multi-Agent Reinforcement Learning with Communication-Constrained Priors
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03528v1 Announce Type: new Abstract: Communication is one of the effective means to improve the learning of cooperative policy in multi-agent systems. However, in most real-world scenarios, lossy communication is a prevalent issue. Existing multi-agent reinforcement learning with communication, due to their limited scalability and robustness, struggles to apply to complex and dynamic real-world environments. To address these challenges, we propose a generalized communication-constrained model to uniformly characterize communication conditions across different scenarios. Based on this, we utilize it as a learning prior to distinguish between lossy and lossless messages for specific scenarios. Additionally, we decouple the impact of lossy and lossless messages on distributed decision-making, drawing on a dual mutual information estimatior, and introduce a communication-constrained multi-agent reinforcement learning framework, quantifying the impact of communication messages into the global reward. Finally, we validate the effectiveness of our approach across several communication-constrained benchmarks.
Read more →
PARC: An Autonomous Self-Reflective Coding Agent for Robust Execution of Long-Horizon Tasks
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03549v1 Announce Type: new Abstract: We introduce PARC, a coding agent for the autonomous and robust execution of long-horizon computational tasks. PARC is built on a hierarchical multi-agent architecture incorporating task planning, execution, and a mechanism that evaluates its own actions and their outcomes from an independent context and provides feedback, namely self-assessment and self-feedback. This design enables PARC to detect and correct high-level strategic errors and sustain progress without human intervention. We evaluate PARC across computational science and data science tasks. In materials science, it autonomously reproduces key results from studies on lithium-ion conduction and alloy segregation. In particular, it coordinates dozens of parallel simulation tasks, each requiring roughly 43 hours of computation, managing orchestration, monitoring, and error correction end-to-end. In Kaggle-based experiments, starting from minimal natural-language instructions, PARC conducts data analysis and implements search strategies, producing solutions competitive with human-engineered baselines. These results highlight the potential of integrating a hierarchical multi-agent system with self-assessment and self-feedback to enable AI systems capable of independent, large-scale scientific and analytical work.
Read more →
Reason-Plan-ReAct: A Reasoner-Planner Supervising a ReAct Executor for Complex Enterprise Tasks
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03560v1 Announce Type: new Abstract: Despite recent advances, autonomous agents often struggle to solve complex tasks in enterprise domains that require coordinating multiple tools and processing diverse data sources. This struggle is driven by two main limitations. First, single-agent architectures enforce a monolithic plan-execute loop, which directly causes trajectory instability. Second, the requirement to use local open-weight models for data privacy introduces smaller context windows leading to the rapid consumption of context from large tool outputs. To solve this problem we introduce RP-ReAct (Reasoner Planner-ReAct), a novel multi-agent approach that fundamentally decouples strategic planning from low-level execution to achieve superior reliability and efficiency. RP-ReAct consists of a Reasoner Planner Agent (RPA), responsible for planning each sub-step, continuously analysing the execution results using the strong reasoning capabilities of a Large Reasoning Model, and one or multiple Proxy-Execution Agent (PEA) that translates sub-steps into concrete tool interactions using a ReAct approach. Crucially, we incorporate a context-saving strategy within the PEA to mitigate context window overflow by managing large tool outputs via external storage and on-demand access. We evaluate RP-ReAct, on the challenging, multi-domain ToolQA benchmark using a diverse set of six open-weight reasoning models. Our empirical results show that RP-ReAct achieves superior performance and improved generalization ability over state-of-the-art baselines when addressing diverse complex tasks across the evaluated domains. Furthermore we establish the enhanced robustness and stability of our approach across different model scales, paving the way for effective and deployable agentic solutions for enterprises.
Read more →
EnCompass: Enhancing Agent Programming with Search Over Program Execution Paths
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03571v1 Announce Type: new Abstract: We introduce a new approach to agent programming, the development of LLM-based agents. Current approaches to agent programming often entangle two aspects of agent design: the core workflow logic and the inference-time strategy (e.g., tree search). We introduce "probabilistic angelic nondeterminism" ("PAN"), a programming model that disentangles these two concerns, allowing the programmer to describe the agent workflow and independently experiment with different inference-time strategies by simply changing a few inputs. We provide an implementation of PAN in Python as the EnCompass framework, which uses a Python decorator to compile agent workflow programs into a search space. We present three case studies that demonstrate how the framework lets the programmer quickly improve the reliability of an agent and easily switch between different inference-time strategies, all with little additional coding.
Read more →
DeepRule: An Integrated Framework for Automated Business Rule Generation via Deep Predictive Modeling and Hybrid Search Optimization
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03607v1 Announce Type: new Abstract: This paper proposes DeepRule, an integrated framework for automated business rule generation in retail assortment and pricing optimization. Addressing the systematic misalignment between existing theoretical models and real-world economic complexities, we identify three critical gaps: (1) data modality mismatch where unstructured textual sources (e.g. negotiation records, approval documents) impede accurate customer profiling; (2) dynamic feature entanglement challenges in modeling nonlinear price elasticity and time-varying attributes; (3) operational infeasibility caused by multi-tier business constraints. Our framework introduces a tri-level architecture for above challenges. We design a hybrid knowledge fusion engine employing large language models (LLMs) for deep semantic parsing of unstructured text, transforming distributor agreements and sales assessments into structured features while integrating managerial expertise. Then a game-theoretic constrained optimization mechanism is employed to dynamically reconcile supply chain interests through bilateral utility functions, encoding manufacturer-distributor profit redistribution as endogenous objectives under hierarchical constraints. Finally an interpretable decision distillation interface leveraging LLM-guided symbolic regression to find and optimize pricing strategies and auditable business rules embeds economic priors (e.g. non-negative elasticity) as hard constraints during mathematical expression search. We validate the framework in real retail environments achieving higher profits versus systematic B2C baselines while ensuring operational feasibility. This establishes a close-loop pipeline unifying unstructured knowledge injection, multi-agent optimization, and interpretable strategy synthesis for real economic intelligence.
Read more →
MemVerse: Multimodal Memory for Lifelong Learning Agents
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03627v1 Announce Type: new Abstract: Despite rapid progress in large-scale language and vision models, AI agents still suffer from a fundamental limitation: they cannot remember. Without reliable memory, agents catastrophically forget past experiences, struggle with long-horizon reasoning, and fail to operate coherently in multimodal or interactive environments. We introduce MemVerse, a model-agnostic, plug-and-play memory framework that bridges fast parametric recall with hierarchical retrieval-based memory, enabling scalable and adaptive multimodal intelligence. MemVerse maintains short-term memory for recent context while transforming raw multimodal experiences into structured long-term memories organized as hierarchical knowledge graphs. This design supports continual consolidation, adaptive forgetting, and bounded memory growth. To handle real-time demands, MemVerse introduces a periodic distillation mechanism that compresses essential knowledge from long-term memory into the parametric model, allowing fast, differentiable recall while preserving interpretability. Extensive experiments demonstrate that MemVerse significantly improves multimodal reasoning and continual learning efficiency, empowering agents to remember, adapt, and reason coherently across extended interactions.
Read more →
RoCo: Role-Based LLMs Collaboration for Automatic Heuristic Design
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03762v2 Announce Type: new Abstract: Automatic Heuristic Design (AHD) has gained traction as a promising solution for solving combinatorial optimization problems (COPs). Large Language Models (LLMs) have emerged and become a promising approach to achieving AHD, but current LLM-based AHD research often only considers a single role. This paper proposes RoCo, a novel Multi-Agent Role-Based System, to enhance the diversity and quality of AHD through multi-role collaboration. RoCo coordinates four specialized LLM-guided agents-explorer, exploiter, critic, and integrator-to collaboratively generate high-quality heuristics. The explorer promotes long-term potential through creative, diversity-driven thinking, while the exploiter focuses on short-term improvements via conservative, efficiency-oriented refinements. The critic evaluates the effectiveness of each evolution step and provides targeted feedback and reflection. The integrator synthesizes proposals from the explorer and exploiter, balancing innovation and exploitation to drive overall progress. These agents interact in a structured multi-round process involving feedback, refinement, and elite mutations guided by both short-term and accumulated long-term reflections. We evaluate RoCo on five different COPs under both white-box and black-box settings. Experimental results demonstrate that RoCo achieves superior performance, consistently generating competitive heuristics that outperform existing methods including ReEvo and HSEvo, both in white-box and black-box scenarios. This role-based collaborative paradigm establishes a new standard for robust and high-performing AHD.
Read more →
Omni-AutoThink: Adaptive Multimodal Reasoning via Reinforcement Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03783v2 Announce Type: new Abstract: Recent advances in Omni models have enabled unified multimodal perception and generation. However, most existing systems still exhibit rigid reasoning behaviors, either overthinking simple problems or failing to reason when necessary. To address this limitation, we propose Omni-AutoThink, a novel adaptive reasoning framework that dynamically adjusts the model's reasoning depth according to task difficulty. Our framework comprises two stages: (1) an Adaptive Supervised Fine-Tuning (Adaptive SFT) stage, which endows the Omni model with fundamental reasoning capability using large-scale reasoning-augmented data, and (2) an Adaptive Reinforcement Learning (Adaptive GRPO) stage, which optimizes reasoning behaviors based on task complexity and reward feedback. We further construct a comprehensive adaptive reasoning benchmark that spans text-only, text-audio, text-visual, and text-audio-visual modalities, providing both training and evaluation splits for multimodal reasoning assessment. Experimental results demonstrate that our proposed framework significantly improves adaptive reasoning performance compared to previous baselines. All benchmark data and code will be publicly released.
Read more →
A Hierarchical Tree-based approach for creating Configurable and Static Deep Research Agent (Static-DRA)
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03887v2 Announce Type: new Abstract: The advancement in Large Language Models has driven the creation of complex agentic systems, such as Deep Research Agents (DRAs), to overcome the limitations of static Retrieval Augmented Generation (RAG) pipelines in handling complex, multi-turn research tasks. This paper introduces the Static Deep Research Agent (Static-DRA), a novel solution built upon a configurable and hierarchical Tree-based static workflow. The core contribution is the integration of two user-tunable parameters, Depth and Breadth, which provide granular control over the research intensity. This design allows end-users to consciously balance the desired quality and comprehensiveness of the research report against the associated computational cost of Large Language Model (LLM) interactions. The agent's architecture, comprising Supervisor, Independent, and Worker agents, facilitates effective multi-hop information retrieval and parallel sub-topic investigation. We evaluate the Static-DRA against the established DeepResearch Bench using the RACE (Reference-based Adaptive Criteria-driven Evaluation) framework. Configured with a depth of 2 and a breadth of 5, and powered by the gemini-2.5-pro model, the agent achieved an overall score of 34.72. Our experiments validate that increasing the configured Depth and Breadth parameters results in a more in-depth research process and a correspondingly higher evaluation score. The Static-DRA offers a pragmatic and resource-aware solution, empowering users with transparent control over the deep research process. The entire source code, outputs and benchmark results are open-sourced at https://github.com/SauravP97/Static-Deep-Research/
Read more →
Autonomous Agents and Policy Compliance: A Framework for Reasoning About Penalties
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03931v1 Announce Type: new Abstract: This paper presents a logic programming-based framework for policy-aware autonomous agents that can reason about potential penalties for non-compliance and act accordingly. While prior work has primarily focused on ensuring compliance, our approach considers scenarios where deviating from policies may be necessary to achieve high-stakes goals. Additionally, modeling non-compliant behavior can assist policymakers by simulating realistic human decision-making. Our framework extends Gelfond and Lobo's Authorization and Obligation Policy Language (AOPL) to incorporate penalties and integrates Answer Set Programming (ASP) for reasoning. Compared to previous approaches, our method ensures well-formed policies, accounts for policy priorities, and enhances explainability by explicitly identifying rule violations and their consequences. Building on the work of Harders and Inclezan, we introduce penalty-based reasoning to distinguish between non-compliant plans, prioritizing those with minimal repercussions. To support this, we develop an automated translation from the extended AOPL into ASP and refine ASP-based planning algorithms to account for incurred penalties. Experiments in two domains demonstrate that our framework generates higher-quality plans that avoid harmful actions while, in some cases, also improving computational efficiency. These findings underscore its potential for enhancing autonomous decision-making and informing policy refinement. Under consideration in Theory and Practice of Logic Programming (TPLP).
Read more →
Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03955v1 Announce Type: new Abstract: Industrial automation increasingly requires flexible control strategies that can adapt to changing tasks and environments. Agents based on Large Language Models (LLMs) offer potential for such adaptive planning and execution but lack standardized benchmarks for systematic comparison. We introduce a benchmark with an executable simulation environment representing the Blocksworld problem providing five complexity categories. By integrating the Model Context Protocol (MCP) as a standardized tool interface, diverse agent architectures can be connected to and evaluated against the benchmark without implementation-specific modifications. A single-agent implementation demonstrates the benchmark's applicability, establishing quantitative metrics for comparison of LLM-based planning and execution approaches.
Read more →
AI-Driven Document Redaction in UK Public Authorities: Implementation Gaps, Regulatory Challenges, and the Human Oversight Imperative
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.02774v1 Announce Type: cross Abstract: Document redaction in public authorities faces critical challenges as traditional manual approaches struggle to balance growing transparency demands with increasingly stringent data protection requirements. This study investigates the implementation of AI-driven document redaction within UK public authorities through Freedom of Information (FOI) requests. While AI technologies offer potential solutions to redaction challenges, their actual implementation within public sector organizations remains underexplored. Based on responses from 44 public authorities across healthcare, government, and higher education sectors, this study reveals significant gaps between technological possibilities and organizational realities. Findings show highly limited AI adoption (only one authority reported using AI tools), widespread absence of formal redaction policies (50 percent reported "information not held"), and deficiencies in staff training. The study identifies three key barriers to effective AI implementation: poor record-keeping practices, lack of standardized redaction guidelines, and insufficient specialized training for human oversight. These findings highlight the need for a socio-technical approach that balances technological automation with meaningful human expertise. This research provides the first empirical assessment of AI redaction practices in UK public authorities and contributes evidence to support policymakers navigating the complex interplay between transparency obligations, data protection requirements, and emerging AI technologies in public administration.
Read more →
Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03047v1 Announce Type: cross Abstract: Large language model safety is usually assessed with static benchmarks, but key failures are dynamic: value drift under distribution shift, jailbreak attacks, and slow degradation of alignment in deployment. Building on a recent Second Law of Intelligence that treats ethical entropy as a state variable which tends to increase unless countered by alignment work, we make this framework operational for large language models. We define a five-way behavioral taxonomy, train a classifier to estimate ethical entropy S(t) from model transcripts, and measure entropy dynamics for base and instruction-tuned variants of four frontier models across stress tests. Base models show sustained entropy growth, while tuned variants suppress drift and reduce ethical entropy by roughly eighty percent. From these trajectories we estimate an effective alignment work rate gamma_eff and embed S(t) and gamma_eff in a monitoring pipeline that raises alerts when entropy drift exceeds a stability threshold, enabling run-time oversight of value drift.
Read more →
Mitigating hallucinations and omissions in LLMs for invertible problems: An application to hardware logic design automation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03053v1 Announce Type: cross Abstract: We show for invertible problems that transform data from a source domain (for example, Logic Condition Tables (LCTs)) to a destination domain (for example, Hardware Description Language (HDL) code), an approach of using Large Language Models (LLMs) as a lossless encoder from source to destination followed by as a lossless decoder back to the source, comparable to lossless compression in information theory, can mitigate most of the LLM drawbacks of hallucinations and omissions. Specifically, using LCTs as inputs, we generate the full HDL for a two-dimensional network-on-chip router (13 units, 1500-2000 lines of code) using seven different LLMs, reconstruct the LCTs from the auto-generated HDL, and compare the original and reconstructed LCTs. This approach yields significant productivity improvements, not only confirming correctly generated LLM logic and detecting incorrectly generated LLM logic but also assisting developers in finding design specification errors.
Read more →
Energy-Efficient Federated Learning via Adaptive Encoder Freezing for MRI-to-CT Conversion: A Green AI-Guided Research
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03054v1 Announce Type: cross Abstract: Federated Learning (FL) holds the potential to advance equality in health by enabling diverse institutions to collaboratively train deep learning (DL) models, even with limited data. However, the significant resource requirements of FL often exclude centres with limited computational infrastructure, further widening existing healthcare disparities. To address this issue, we propose a Green AI-oriented adaptive layer-freezing strategy designed to reduce energy consumption and computational load while maintaining model performance. We tested our approach using different federated architectures for Magnetic Resonance Imaging (MRI)-to-Computed Tomography (CT) conversion. The proposed adaptive strategy optimises the federated training by selectively freezing the encoder weights based on the monitored relative difference of the encoder weights from round to round. A patience-based mechanism ensures that freezing only occurs when updates remain consistently minimal. The energy consumption and CO2eq emissions of the federation were tracked using the CodeCarbon library. Compared to equivalent non-frozen counterparts, our approach reduced training time, total energy consumption and CO2eq emissions by up to 23%. At the same time, the MRI-to-CT conversion performance was maintained, with only small variations in the Mean Absolute Error (MAE). Notably, for three out of the five evaluated architectures, no statistically significant differences were observed, while two architectures exhibited statistically significant improvements. Our work aligns with a research paradigm that promotes DL-based frameworks meeting clinical requirements while ensuring climatic, social, and economic sustainability. It lays the groundwork for novel FL evaluation frameworks, advancing privacy, equity and, more broadly, justice in AI-driven healthcare.
Read more →
Physics-informed self-supervised learning for predictive modeling of coronary artery digital twins
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03055v1 Announce Type: cross Abstract: Cardiovascular disease is the leading global cause of mortality, with coronary artery disease (CAD) as its most prevalent form, necessitating early risk prediction. While 3D coronary artery digital twins reconstructed from imaging offer detailed anatomy for personalized assessment, their analysis relies on computationally intensive computational fluid dynamics (CFD), limiting scalability. Data-driven approaches are hindered by scarce labeled data and lack of physiological priors. To address this, we present PINS-CAD, a physics-informed self-supervised learning framework. It pre-trains graph neural networks on 200,000 synthetic coronary digital twins to predict pressure and flow, guided by 1D Navier-Stokes equations and pressure-drop laws, eliminating the need for CFD or labeled data. When fine-tuned on clinical data from 635 patients in the multicenter FAME2 study, PINS-CAD predicts future cardiovascular events with an AUC of 0.73, outperforming clinical risk scores and data-driven baselines. This demonstrates that physics-informed pretraining boosts sample efficiency and yields physiologically meaningful representations. Furthermore, PINS-CAD generates spatially resolved pressure and fractional flow reserve curves, providing interpretable biomarkers. By embedding physical priors into geometric deep learning, PINS-CAD transforms routine angiography into a simulation-free, physiology-aware framework for scalable, preventive cardiology.
Read more →
Delta Sampling: Data-Free Knowledge Transfer Across Diffusion Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03056v1 Announce Type: cross Abstract: Diffusion models like Stable Diffusion (SD) drive a vibrant open-source ecosystem including fully fine-tuned checkpoints and parameter-efficient adapters such as LoRA, LyCORIS, and ControlNet. However, these adaptation components are tightly coupled to a specific base model, making them difficult to reuse when the base model is upgraded (e.g., from SD 1.x to 2.x) due to substantial changes in model parameters and architecture. In this work, we propose Delta Sampling (DS), a novel method that enables knowledge transfer across base models with different architectures, without requiring access to the original training data. DS operates entirely at inference time by leveraging the delta: the difference in model predictions before and after the adaptation of a base model. This delta is then used to guide the denoising process of a new base model. We evaluate DS across various SD versions, demonstrating that DS achieves consistent improvements in creating desired effects (e.g., visual styles, semantic concepts, and structures) under different sampling strategies. These results highlight DS as an effective, plug-and-play mechanism for knowledge transfer in diffusion-based image synthesis. Code:~ https://github.com/Zhidong-Gao/DeltaSampling
Read more →
A note on the impossibility of conditional PAC-efficient reasoning in large language models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03057v1 Announce Type: cross Abstract: We prove an impossibility result for conditional Probably Approximately Correct (PAC)-efficient reasoning in large language models. While recent work has established marginal PAC efficiency guarantees for composite models that switch between expensive expert models and cheaper fast models, we show that conditional (pointwise) guarantees are impossible in the distribution-free setting. Specifically, for non-atomic input spaces, any algorithm achieving conditional PAC efficiency must be trivial in the sense that it defers to the expert model with probability at least $1-\alpha$ for almost every input.
Read more →
Optimizing Life Sciences Agents in Real-Time using Reinforcement Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03065v1 Announce Type: cross Abstract: Generative AI agents in life sciences face a critical challenge: determining the optimal approach for diverse queries ranging from simple factoid questions to complex mechanistic reasoning. Traditional methods rely on fixed rules or expensive labeled training data, neither of which adapts to changing conditions or user preferences. We present a novel framework that combines AWS Strands Agents with Thompson Sampling contextual bandits to enable AI agents to learn optimal decision-making strategies from user feedback alone. Our system optimizes three key dimensions: generation strategy selection (direct vs. chain-of-thought), tool selection (literature search, drug databases, etc.), and domain routing (pharmacology, molecular biology, clinical specialists). Through empirical evaluation on life science queries, we demonstrate 15-30\% improvement in user satisfaction compared to random baselines, with clear learning patterns emerging after 20-30 queries. Our approach requires no ground truth labels, adapts continuously to user preferences, and provides a principled solution to the exploration-exploitation dilemma in agentic AI systems.
Read more →
Quantifying the Potential to Escape Filter Bubbles: A Behavior-Aware Measure via Contrastive Simulation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03067v1 Announce Type: cross Abstract: Nowadays, recommendation systems have become crucial to online platforms, shaping user exposure by accurate preference modeling. However, such an exposure strategy can also reinforce users' existing preferences, leading to a notorious phenomenon named filter bubbles. Given its negative effects, such as group polarization, increasing attention has been paid to exploring reasonable measures to filter bubbles. However, most existing evaluation metrics simply measure the diversity of user exposure, failing to distinguish between algorithmic preference modeling and actual information confinement. In view of this, we introduce Bubble Escape Potential (BEP), a behavior-aware measure that quantifies how easily users can escape from filter bubbles. Specifically, BEP leverages a contrastive simulation framework that assigns different behavioral tendencies (e.g., positive vs. negative) to synthetic users and compares the induced exposure patterns. This design enables decoupling the effect of filter bubbles and preference modeling, allowing for more precise diagnosis of bubble severity. We conduct extensive experiments across multiple recommendation models to examine the relationship between predictive accuracy and bubble escape potential across different groups. To the best of our knowledge, our empirical results are the first to quantitatively validate the dilemma between preference modeling and filter bubbles. What's more, we observe a counter-intuitive phenomenon that mild random recommendations are ineffective in alleviating filter bubbles, which can offer a principled foundation for further work in this direction.
Read more →
Echoes of AI Harms: A Human-LLM Synergistic Framework for Bias-Driven Harm Anticipation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03068v1 Announce Type: cross Abstract: The growing influence of Artificial Intelligence (AI) systems on decision-making in critical domains has exposed their potential to cause significant harms, often rooted in biases embedded across the AI lifecycle. While existing frameworks and taxonomies document bias or harms in isolation, they rarely establish systematic links between specific bias types and the harms they cause, particularly within real-world sociotechnical contexts. Technical fixes proposed to address AI biases are ill-equipped to address them and are typically applied after a system has been developed or deployed, offering limited preventive value. We propose ECHO, a novel framework for proactive AI harm anticipation through the systematic mapping of AI bias types to harm outcomes across diverse stakeholder and domain contexts. ECHO follows a modular workflow encompassing stakeholder identification, vignette-based presentation of biased AI systems, and dual (human-LLM) harm annotation, integrated within ethical matrices for structured interpretation. This human-centered approach enables early-stage detection of bias-to-harm pathways, guiding AI design and governance decisions from the outset. We validate ECHO in two high-stakes domains (disease diagnosis and hiring), revealing domain-specific, bias-to-harm patterns and demonstrating ECHO's potential to support anticipatory governance of AI systems
Read more →
Hierarchical clustering of complex energy systems using pretopology
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03069v1 Announce Type: cross Abstract: This article attempts answering the following problematic: How to model and classify energy consumption profiles over a large distributed territory to optimize the management of buildings' consumption? Doing case-by-case in depth auditing of thousands of buildings would require a massive amount of time and money as well as a significant number of qualified people. Thus, an automated method must be developed to establish a relevant and effective recommendations system. To answer this problematic, pretopology is used to model the sites' consumption profiles and a multi-criterion hierarchical classification algorithm, using the properties of pretopological space, has been developed in a Python library. To evaluate the results, three data sets are used: A generated set of dots of various sizes in a 2D space, a generated set of time series and a set of consumption time series of 400 real consumption sites from a French Energy company. On the point data set, the algorithm is able to identify the clusters of points using their position in space and their size as parameter. On the generated time series, the algorithm is able to identify the time series clusters using Pearson's correlation with an Adjusted Rand Index (ARI) of 1.
Read more →
Mixed Data Clustering Survey and Challenges
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03070v1 Announce Type: cross Abstract: The advent of the big data paradigm has transformed how industries manage and analyze information, ushering in an era of unprecedented data volume, velocity, and variety. Within this landscape, mixed-data clustering has become a critical challenge, requiring innovative methods that can effectively exploit heterogeneous data types, including numerical and categorical variables. Traditional clustering techniques, typically designed for homogeneous datasets, often struggle to capture the additional complexity introduced by mixed data, underscoring the need for approaches specifically tailored to this setting. Hierarchical and explainable algorithms are particularly valuable in this context, as they provide structured, interpretable clustering results that support informed decision-making. This paper introduces a clustering method grounded in pretopological spaces. In addition, benchmarking against classical numerical clustering algorithms and existing pretopological approaches yields insights into the performance and effectiveness of the proposed method within the big data paradigm.
Read more →
PretopoMD: Pretopology-based Mixed Data Hierarchical Clustering
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03071v1 Announce Type: cross Abstract: This article presents a novel pretopology-based algorithm designed to address the challenges of clustering mixed data without the need for dimensionality reduction. Leveraging Disjunctive Normal Form, our approach formulates customizable logical rules and adjustable hyperparameters that allow for user-defined hierarchical cluster construction and facilitate tailored solutions for heterogeneous datasets. Through hierarchical dendrogram analysis and comparative clustering metrics, our method demonstrates superior performance by accurately and interpretably delineating clusters directly from raw data, thus preserving data integrity. Empirical findings highlight the algorithm's robustness in constructing meaningful clusters and reveal its potential in overcoming issues related to clustered data explainability. The novelty of this work lies in its departure from traditional dimensionality reduction techniques and its innovative use of logical rules that enhance both cluster formation and clarity, thereby contributing a significant advancement to the discourse on clustering mixed data.
Read more →
Economies of Open Intelligence: Tracing Power & Participation in the Model Ecosystem
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03073v1 Announce Type: cross Abstract: Since 2019, the Hugging Face Model Hub has been the primary global platform for sharing open weight AI models. By releasing a dataset of the complete history of weekly model downloads (June 2020-August 2025) alongside model metadata, we provide the most rigorous examination to-date of concentration dynamics and evolving characteristics in the open model economy. Our analysis spans 851,000 models, over 200 aggregated attributes per model, and 2.2B downloads. We document a fundamental rebalancing of economic power: US open-weight industry dominance by Google, Meta, and OpenAI has declined sharply in favor of unaffiliated developers, community organizations, and, as of 2025, Chinese industry, with DeepSeek and Qwen models potentially heralding a new consolidation of market power. We identify statistically significant shifts in model properties, a 17X increase in average model size, rapid growth in multimodal generation (3.4X), quantization (5X), and mixture-of-experts architectures (7X), alongside concerning declines in data transparency, with open weights models surpassing truly open source models for the first time in 2025. We expose a new layer of developer intermediaries that has emerged, focused on quantizing and adapting base models for both efficiency and artistic expression. To enable continued research and oversight, we release the complete dataset with an interactive dashboard for real-time monitoring of concentration dynamics and evolving properties in the open model economy.
Read more →
Will Power Return to the Clouds? From Divine Authority to GenAI Authority
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03076v1 Announce Type: cross Abstract: Generative AI systems now mediate newsfeeds, search rankings, and creative content for hundreds of millions of users, positioning a handful of private firms as de-facto arbiters of truth. Drawing on a comparative-historical lens, this article juxtaposes the Galileo Affair, a touchstone of clerical knowledge control, with contemporary Big-Tech content moderation. We integrate Foucault's power/knowledge thesis, Weber's authority types (extended to a rational-technical and emerging agentic-technical modality), and Floridi's Dataism to analyze five recurrent dimensions: disciplinary power, authority modality, data pluralism, trust versus reliance, and resistance pathways. Primary sources (Inquisition records; platform transparency reports) and recent empirical studies on AI trust provide the evidentiary base. Findings show strong structural convergences: highly centralized gatekeeping, legitimacy claims couched in transcendent principles, and systematic exclusion of marginal voices. Divergences lie in temporal velocity, global scale, and the widening gap between public reliance and trust in AI systems. Ethical challenges cluster around algorithmic opacity, linguistic inequity, bias feedback loops, and synthetic misinformation. We propose a four-pillar governance blueprint: (1) a mandatory international model-registry with versioned policy logs, (2) representation quotas and regional observatories to de-center English-language hegemony, (3) mass critical-AI literacy initiatives, and (4) public-private support for community-led data trusts. Taken together, these measures aim to narrow the trust-reliance gap and prevent GenAI from hardcoding a twenty-first-century digital orthodoxy.
Read more →
Irresponsible AI: big tech's influence on AI research and associated impacts
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03077v1 Announce Type: cross Abstract: The accelerated development, deployment and adoption of artificial intelligence systems has been fuelled by the increasing involvement of big tech. This has been accompanied by increasing ethical concerns and intensified societal and environmental impacts. In this article, we review and discuss how these phenomena are deeply entangled. First, we examine the growing and disproportionate influence of big tech in AI research and argue that its drive for scaling and general-purpose systems is fundamentally at odds with the responsible, ethical, and sustainable development of AI. Second, we review key current environmental and societal negative impacts of AI and trace their connections to big tech and its underlying economic incentives. Finally, we argue that while it is important to develop technical and regulatory approaches to these challenges, these alone are insufficient to counter the distortion introduced by big tech's influence. We thus review and propose alternative strategies that build on the responsibility of implicated actors and collective action.
Read more →
AtomDisc: An Atom-level Tokenizer that Boosts Molecular LLMs and Reveals Structure--Property Associations
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03080v1 Announce Type: cross Abstract: Advances in large language models (LLMs) are accelerating discovery in molecular science. However, adapting molecular information to the serialized, token-based processing of LLMs remains a key challenge. Compared to other representations, molecular graphs explicitly encode atomic connectivity and local topological environments, which are key determinants of atomic behavior and molecular properties. Despite recent efforts to tokenize overall molecular topology, there still lacks effective fine-grained tokenization of local atomic environments, which are critical for determining sophisticated chemical properties and reactivity. To address these issues, we introduce AtomDisc, a novel framework that quantizes atom-level local environments into structure-aware tokens embedded directly in LLM's token space. Our experiments show that AtomDisc, in a data-driven way, can distinguish chemically meaningful structural features that reveal structure-property associations. Equipping LLMs with AtomDisc tokens injects an interpretable inductive bias that delivers state-of-the-art performance on property prediction and molecular generation. Our methodology and findings can pave the way for constructing more powerful molecular LLMs aimed at mechanistic insight and complex chemical reasoning.
Read more →
Alleviating Choice Supportive Bias in LLM with Reasoning Dependency Generation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03082v1 Announce Type: cross Abstract: Recent studies have demonstrated that some Large Language Models exhibit choice-supportive bias (CSB) when performing evaluations, systematically favoring their chosen options and potentially compromising the objectivity of AI-assisted decision making. While existing debiasing approaches primarily target demographic and social biases, methods for addressing cognitive biases in LLMs remain largely unexplored. In this work, we present the first solution to address CSB through Reasoning Dependency Generation (RDG), a novel framework for generating unbiased reasoning data to mitigate choice-supportive bias through fine-tuning. RDG automatically constructs balanced reasoning QA pairs, explicitly (un)modeling the dependencies between choices, evidences, and justifications. Our approach is able to generate a large-scale dataset of QA pairs across domains, incorporating Contextual Dependency Data and Dependency Decouple Data. Experiments show that LLMs fine-tuned on RDG-generated data demonstrate a 81.5% improvement in memory-based experiments and 94.3% improvement in the evaluation-based experiment, while maintaining similar performance on standard BBQ benchmarks. This work pioneers an approach for addressing cognitive biases in LLMs and contributes to the development of more reliable AI-assisted decision support systems.
Read more →
Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03086v1 Announce Type: cross Abstract: Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code pair datasets, our approach additionally generates (1) verified translations with unit tests for assessing functional consistency, and (2) multi-turn dialogues that capture the reasoning process behind translation refinement. Applied to Fortran -> C++ and C++ -> CUDA, the pipeline yields 3.64k and 3.93k dialogues, respectively. Fine-tuning on this data yields dramatic improvements in functional correctness, boosting unit test success rates by over 56% on the challenging C++-to-CUDA task. We show this data enables a 7B open-weight model to significantly outperform larger proprietary systems on key metrics like compilation success.
Read more →
When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03087v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: \textbf{can LVLMs perceive such camouflaged harmful content as sensitively as humans do?} In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75\% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10\% accuracy). Moreover, fine-tuning experiments demonstrate that \bench serves as an effective resource for improving model perception, increasing accuracy by 55.94\% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.
Read more →
Password-Activated Shutdown Protocols for Misaligned Frontier Agents
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03089v1 Announce Type: cross Abstract: Frontier AI developers may fail to align or control highly-capable AI agents. In many cases, it could be useful to have emergency shutdown mechanisms which effectively prevent misaligned agents from carrying out harmful actions in the world. We introduce password-activated shutdown protocols (PAS protocols) -- methods for designing frontier agents to implement a safe shutdown protocol when given a password. We motivate PAS protocols by describing intuitive use-cases in which they mitigate risks from misaligned systems that subvert other control efforts, for instance, by disabling automated monitors or self-exfiltrating to external data centres. PAS protocols supplement other safety efforts, such as alignment fine-tuning or monitoring, contributing to defence-in-depth against AI risk. We provide a concrete demonstration in SHADE-Arena, a benchmark for AI monitoring and subversion capabilities, in which PAS protocols supplement monitoring to increase safety with little cost to performance. Next, PAS protocols should be robust to malicious actors who want to bypass shutdown. Therefore, we conduct a red-team blue-team game between the developers (blue-team), who must implement a robust PAS protocol, and a red-team trying to subvert the protocol. We conduct experiments in a code-generation setting, finding that there are effective strategies for the red-team, such as using another model to filter inputs, or fine-tuning the model to prevent shutdown behaviour. We then outline key challenges to implementing PAS protocols in real-life systems, including: security considerations of the password and decisions regarding when, and in which systems, to use them. PAS protocols are an intuitive mechanism for increasing the safety of frontier AI. We encourage developers to consider implementing PAS protocols prior to internal deployment of particularly dangerous systems to reduce loss-of-control risks.
Read more →
Community Quality and Influence Maximization: An Empirical Study
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03095v1 Announce Type: cross Abstract: Influence maximization in social networks plays a vital role in applications such as viral marketing, epidemiology, product recommendation, opinion mining, and counter-terrorism. A common approach identifies seed nodes by first detecting disjoint communities and subsequently selecting representative nodes from these communities. However, whether the quality of detected communities consistently affects the spread of influence under the Independent Cascade model remains unclear. This paper addresses this question by extending a previously proposed disjoint community detection method, termed $\alpha$-Hierarchical Clustering, to the influence maximization problem under the Independent Cascade model. The proposed method is compared with an alternative approach that employs the same seed selection criteria but relies on communities of lower quality obtained through standard Hierarchical Clustering. The former is referred to as Hierarchical Clustering-based Influence Maximization, while the latter, which leverages higher-quality community structures to guide seed selection, is termed $\alpha$-Hierarchical Clustering-based Influence Maximization. Extensive experiments are performed on multiple real-world datasets to assess the effectiveness of both methods. The results demonstrate that higher-quality community structures substantially improve information diffusion under the Independent Cascade model, particularly when the propagation probability is low. These findings underscore the critical importance of community quality in guiding effective seed selection for influence maximization in complex networks.
Read more →
QGShap: Quantum Acceleration for Faithful GNN Explanations
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03099v1 Announce Type: cross Abstract: Graph Neural Networks (GNNs) have become indispensable in critical domains such as drug discovery, social network analysis, and recommendation systems, yet their black-box nature hinders deployment in scenarios requiring transparency and accountability. While Shapley value-based methods offer mathematically principled explanations by quantifying each component's contribution to predictions, computing exact values requires evaluating $2^n$ coalitions (or aggregating over $n!$ permutations), which is intractable for real-world graphs. Existing approximation strategies sacrifice either fidelity or efficiency, limiting their practical utility. We introduce QGShap, a quantum computing approach that leverages amplitude amplification to achieve quadratic speedups in coalition evaluation while maintaining exact Shapley computation. Unlike classical sampling or surrogate methods, our approach provides fully faithful explanations without approximation trade-offs for tractable graph sizes. We conduct empirical evaluations on synthetic graph datasets, demonstrating that QGShap achieves consistently high fidelity and explanation accuracy, matching or exceeding the performance of classical methods across all evaluation metrics. These results collectively demonstrate that QGShap not only preserves exact Shapley faithfulness but also delivers interpretable, stable, and structurally consistent explanations that align with the underlying graph reasoning of GNNs. The implementation of QGShap is available at https://github.com/smlab-niser/qgshap.
Read more →
Ensemble Privacy Defense for Knowledge-Intensive LLMs against Membership Inference Attacks
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03100v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) and Supervised Finetuning (SFT) have become the predominant paradigms for equipping Large Language Models (LLMs) with external knowledge for diverse, knowledge-intensive tasks. However, while such knowledge injection improves performance, it also exposes new attack surfaces. Membership Inference Attacks (MIAs), which aim to determine whether a given data sample was included in a model's training set, pose serious threats to privacy and trust in sensitive domains. To this end, we first systematically evaluate the vulnerability of RAG- and SFT-based LLMs to various MIAs. Then, to address the privacy risk, we further introduce a novel, model-agnostic defense framework, Ensemble Privacy Defense (EPD), which aggregates and evaluates the outputs of a knowledge-injected LLM, a base LLM, and a dedicated judge model to enhance resistance against MIAs. Comprehensive experiments show that, on average, EPD reduces MIA success by up to 27.8\% for SFT and 526.3\% for RAG compared to inference-time baseline, while maintaining answer quality.
Read more →
ALARM: Automated MLLM-Based Anomaly Detection in Complex-EnviRonment Monitoring with Uncertainty Quantification
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03101v1 Announce Type: cross Abstract: The advance of Large Language Models (LLMs) has greatly stimulated research interest in developing multi-modal LLM (MLLM)-based visual anomaly detection (VAD) algorithms that can be deployed in complex environments. The challenge is that in these complex environments, the anomalies are sometimes highly contextual and also ambiguous, and thereby, uncertainty quantification (UQ) is a crucial capacity for an MLLM-based VAD system to succeed. In this paper, we introduce our UQ-supported MLLM-based VAD framework called ALARM. ALARM integrates UQ with quality-assurance techniques like reasoning chain, self-reflection, and MLLM ensemble for robust and accurate performance and is designed based on a rigorous probabilistic inference pipeline and computational process. Extensive empirical evaluations are conducted using the real-world smart-home benchmark data and wound image classification data, which shows ALARM's superior performance and its generic applicability across different domains for reliable decision-making.
Read more →
Dynamic Correction of Erroneous State Estimates via Diffusion Bayesian Exploration
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03102v1 Announce Type: cross Abstract: In emergency response and other high-stakes societal applications, early-stage state estimates critically shape downstream outcomes. Yet, these initial state estimates-often based on limited or biased information-can be severely misaligned with reality, constraining subsequent actions and potentially causing catastrophic delays, resource misallocation, and human harm. Under the stationary bootstrap baseline (zero transition and no rejuvenation), bootstrap particle filters exhibit Stationarity-Induced Posterior Support Invariance (S-PSI), wherein regions excluded by the initial prior remain permanently unexplorable, making corrections impossible even when new evidence contradicts current beliefs. While classical perturbations can in principle break this lock-in, they operate in an always-on fashion and may be inefficient. To overcome this, we propose a diffusion-driven Bayesian exploration framework that enables principled, real-time correction of early state estimation errors. Our method expands posterior support via entropy-regularized sampling and covariance-scaled diffusion. A Metropolis-Hastings check validates proposals and keeps inference adaptive to unexpected evidence. Empirical evaluations on realistic hazardous-gas localization tasks show that our approach matches reinforcement learning and planning baselines when priors are correct. It substantially outperforms classical SMC perturbations and RL-based methods under misalignment, and we provide theoretical guarantees that DEPF resolves S-PSI while maintaining statistical rigor.
Read more →
Public Sentiment Analysis of Traffic Management Policies in Knoxville: A Social Media Driven Study
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03103v1 Announce Type: cross Abstract: This study presents a comprehensive analysis of public sentiment toward traffic management policies in Knoxville, Tennessee, utilizing social media data from Twitter and Reddit platforms. We collected and analyzed 7906 posts spanning January 2022 to December 2023, employing Valence Aware Dictionary and sEntiment Reasoner (VADER) for sentiment analysis and Latent Dirichlet Allocation (LDA) for topic modeling. Our findings reveal predominantly negative sentiment, with significant variations across platforms and topics. Twitter exhibited more negative sentiment compared to Reddit. Topic modeling identified six distinct themes, with construction-related topics showing the most negative sentiment while general traffic discussions were more positive. Spatiotemporal analysis revealed geographic and temporal patterns in sentiment expression. The research demonstrates social media's potential as a real-time public sentiment monitoring tool for transportation planning and policy evaluation.
Read more →
E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03109v1 Announce Type: cross Abstract: Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.
Read more →
The BEAT-CF Causal Model: A model for guiding the design of trials and observational analyses of cystic fibrosis exacerbations
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03110v1 Announce Type: cross Abstract: Loss of lung function in cystic fibrosis (CF) occurs progressively, punctuated by acute pulmonary exacerbations (PEx) in which abrupt declines in lung function are not fully recovered. A key component of CF management over the past half century has been the treatment of PEx to slow lung function decline. This has been credited with improvements in survival for people with CF (PwCF), but there is no consensus on the optimal approach to PEx management. BEAT-CF (Bayesian evidence-adaptive treatment of CF) was established to build an evidence-informed knowledge base for CF management. The BEAT-CF causal model is a directed acyclic graph (DAG) and Bayesian network (BN) for PEx that aims to inform the design and analysis of clinical trials comparing the effectiveness of alternative approaches to PEx management. The causal model describes relationships between background risk factors, treatments, and pathogen colonisation of the airways that affect the outcome of an individual PEx episode. The key factors, outcomes, and causal relationships were elicited from CF clinical experts and together represent current expert understanding of the pathophysiology of a PEx episode, guiding the design of data collection and studies and enabling causal inference. Here, we present the DAG that documents this understanding, along with the processes used in its development, providing transparency around our trial design and study processes, as well as a reusable framework for others.
Read more →
PanFoMa: A Lightweight Foundation Model and Benchmark for Pan-Cancer
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03111v1 Announce Type: cross Abstract: Single-cell RNA sequencing (scRNA-seq) is essential for decoding tumor heterogeneity. However, pan-cancer research still faces two key challenges: learning discriminative and efficient single-cell representations, and establishing a comprehensive evaluation benchmark. In this paper, we introduce PanFoMa, a lightweight hybrid neural network that combines the strengths of Transformers and state-space models to achieve a balance between performance and efficiency. PanFoMa consists of a front-end local-context encoder with shared self-attention layers to capture complex, order-independent gene interactions; and a back-end global sequential feature decoder that efficiently integrates global context using a linear-time state-space model. This modular design preserves the expressive power of Transformers while leveraging the scalability of Mamba to enable transcriptome modeling, effectively capturing both local and global regulatory signals. To enable robust evaluation, we also construct a large-scale pan-cancer single-cell benchmark, PanFoMaBench, containing over 3.5 million high-quality cells across 33 cancer subtypes, curated through a rigorous preprocessing pipeline. Experimental results show that PanFoMa outperforms state-of-the-art models on our pan-cancer benchmark (+4.0\%) and across multiple public tasks, including cell type annotation (+7.4\%), batch integration (+4.0\%) and multi-omics integration (+3.1\%). The code is available at https://github.com/Xiaoshui-Huang/PanFoMa.
Read more →
Beyond Additivity: Sparse Isotonic Shapley Regression toward Nonlinear Explainability
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03112v1 Announce Type: cross Abstract: Shapley values, a gold standard for feature attribution in Explainable AI, face two primary challenges. First, the canonical Shapley framework assumes that the worth function is additive, yet real-world payoff constructions--driven by non-Gaussian distributions, heavy tails, feature dependence, or domain-specific loss scales--often violate this assumption, leading to distorted attributions. Secondly, achieving sparse explanations in high dimensions by computing dense Shapley values and then applying ad hoc thresholding is prohibitively costly and risks inconsistency. We introduce Sparse Isotonic Shapley Regression (SISR), a unified nonlinear explanation framework. SISR simultaneously learns a monotonic transformation to restore additivity--obviating the need for a closed-form specification--and enforces an L0 sparsity constraint on the Shapley vector, enhancing computational efficiency in large feature spaces. Its optimization algorithm leverages Pool-Adjacent-Violators for efficient isotonic regression and normalized hard-thresholding for support selection, yielding implementation ease and global convergence guarantees. Analysis shows that SISR recovers the true transformation in a wide range of scenarios and achieves strong support recovery even in high noise. Moreover, we are the first to demonstrate that irrelevant features and inter-feature dependencies can induce a true payoff transformation that deviates substantially from linearity. Experiments in regression, logistic regression, and tree ensembles demonstrate that SISR stabilizes attributions across payoff schemes, correctly filters irrelevant features while standard Shapley values suffer severe rank and sign distortions. By unifying nonlinear transformation estimation with sparsity pursuit, SISR advances the frontier of nonlinear explainability, providing a theoretically grounded and practical attribution framework.
Read more →
Lost in Modality: Evaluating the Effectiveness of Text-Based Membership Inference Attacks on Large Multimodal Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03121v1 Announce Type: cross Abstract: Large Multimodal Language Models (MLLMs) are emerging as one of the foundational tools in an expanding range of applications. Consequently, understanding training-data leakage in these systems is increasingly critical. Log-probability-based membership inference attacks (MIAs) have become a widely adopted approach for assessing data exposure in large language models (LLMs), yet their effect in MLLMs remains unclear. We present the first comprehensive evaluation of extending these text-based MIA methods to multimodal settings. Our experiments under vision-and-text (V+T) and text-only (T-only) conditions across the DeepSeek-VL and InternVL model families show that in in-distribution settings, logit-based MIAs perform comparably across configurations, with a slight V+T advantage. Conversely, in out-of-distribution settings, visual inputs act as regularizers, effectively masking membership signals.
Read more →
Mitigating Intra- and Inter-modal Forgetting in Continual Learning of Unified Multimodal Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03125v1 Announce Type: cross Abstract: Unified Multimodal Generative Models (UMGMs) unify visual understanding and image generation within a single autoregressive framework. However, their ability to continually learn new tasks is severely hindered by catastrophic forgetting, both within a modality (intra-modal) and across modalities (inter-modal). While intra-modal forgetting has been studied in prior continual learning (CL) work, inter-modal forgetting remains largely unexplored. In this paper, we identify and empirically validate this phenomenon in UMGMs and provide a theoretical explanation rooted in gradient conflict between modalities. To address both intra- and inter-modal forgetting, we propose Modality-Decoupled Experts (MoDE), a lightweight and scalable architecture that isolates modality-specific updates to mitigate the gradient conflict and leverages knowledge distillation to prevent catastrophic forgetting and preserve pre-trained capabilities. Unlike previous CL methods that remain modality-coupled and suffer from modality gradient conflict, MoDE explicitly decouples modalities to prevent interference. Experiments across diverse benchmarks demonstrate that MoDE significantly mitigates both inter- and intra-modal forgetting, outperforming prior CL baselines in unified multimodal generation settings. Codes will be publicly available: https://github.com/Christina200/MoDE-official.git
Read more →
Atomic Diffusion Models for Small Molecule Structure Elucidation from NMR Spectra
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03127v1 Announce Type: cross Abstract: Nuclear Magnetic Resonance (NMR) spectroscopy is a cornerstone technique for determining the structures of small molecules and is especially critical in the discovery of novel natural products and clinical therapeutics. Yet, interpreting NMR spectra remains a time-consuming, manual process requiring extensive domain expertise. We introduce ChefNMR (CHemical Elucidation From NMR), an end-to-end framework that directly predicts an unknown molecule's structure solely from its 1D NMR spectra and chemical formula. We frame structure elucidation as conditional generation from an atomic diffusion model built on a non-equivariant transformer architecture. To model the complex chemical groups found in natural products, we generated a dataset of simulated 1D NMR spectra for over 111,000 natural products. ChefNMR predicts the structures of challenging natural product compounds with an unsurpassed accuracy of over 65%. This work takes a significant step toward solving the grand challenge of automating small-molecule structure elucidation and highlights the potential of deep learning in accelerating molecular discovery. Code is available at https://github.com/ml-struct-bio/chefnmr.
Read more →
Culture Affordance Atlas: Reconciling Object Diversity Through Functional Mapping
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03173v1 Announce Type: cross Abstract: Culture shapes the objects people use and for what purposes, yet mainstream Vision-Language (VL) datasets frequently exhibit cultural biases, disproportionately favoring higher-income, Western contexts. This imbalance reduces model generalizability and perpetuates performance disparities, especially impacting lower-income and non-Western communities. To address these disparities, we propose a novel function-centric framework that categorizes objects by the functions they fulfill, across diverse cultural and economic contexts. We implement this framework by creating the Culture Affordance Atlas, a re-annotated and culturally grounded restructuring of the Dollar Street dataset spanning 46 functions and 288 objects publicly available at https://lit.eecs.umich.edu/CultureAffordance-Atlas/index.html. Through extensive empirical analyses using the CLIP model, we demonstrate that function-centric labels substantially reduce socioeconomic performance gaps between high- and low-income groups by a median of 6 pp (statistically significant), improving model effectiveness for lower-income contexts. Furthermore, our analyses reveals numerous culturally essential objects that are frequently overlooked in prominent VL datasets. Our contributions offer a scalable pathway toward building inclusive VL datasets and equitable AI systems.
Read more →
Plantain: Plan-Answer Interleaved Reasoning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03176v1 Announce Type: cross Abstract: Reasoning models often spend a significant amount of time thinking before they generate a visible response. In the meantime, they do not give the user any hints as to whether their reasoning is on the right track, and do not give the user any recourse to stop and correct them if their reasoning is flawed. This creates a frustrating, but unfortunately common, experience: the user's time is wasted while the model reasons from a false premise that could have easily been corrected. In contrast, human speakers typically perform lightweight, incremental grounding acts to ensure that participants in the conversation are on the same page; here we ask if language models can learn to leverage a similar type of behavior? With this motivation, we propose interleaved reasoning (IR), in which the model alternates between thinking and surfacing intermediate responses, as an alternative to the standard "think-then-answer" approach. By providing useful information to the user earlier, IR reduces perceived latency, the time a user waits for an initial output, without compromising the quality of the final response. We further introduce a specialization of interleaved reasoning, Plantain (Plan-Thought-Answer Interleaving), where the first intermediate response is an explicit, step-by-step plan for executing the task. This plan-first strategy allows for user intervention and early feedback for subsequent reasoning steps. We demonstrate that Plantain yields an ~6% improvement in pass@1 across several challenging math reasoning and coding benchmarks, while reducing time-to-first-response by over 60% relative to think-then-answer baselines.
Read more →
Ultra-Strong Gradient Diffusion MRI with Self-Supervised Learning for Prostate Cancer Characterization
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03196v1 Announce Type: cross Abstract: Diffusion MRI (dMRI) enables non-invasive assessment of prostate microstructure but conventional metrics such as the Apparent Diffusion Coefficient in multiparametric MRI lack specificity to underlying histology. Integrating dMRI with the compartment-based biophysical VERDICT (Vascular, Extracellular, and Restricted Diffusion for Cytometry in Tumours) framework offers richer microstructural insights, though clinical gradient systems (40-80 mT/m) suffer from poor signal-to-noise ratio (SNR) at stronger diffusion weightings due to prolonged echo times. Ultra-strong gradients (up to 300 mT/m) can mitigate these limitations by improving SNR and contrast-to-noise ratios (CNR) but their adoption has until recently been limited to research environments due to challenges with peripheral nerve stimulation thresholds and gradient non-uniformity. This study investigates whether physics-informed self-supervised VERDICT (ssVERDICT) fitting applied to ultra-strong gradients enhances prostate cancer characterization relative to current clinical acquisitions. We developed enhanced ssVERDICT fitting approaches using dense multilayer perceptron (Dense MLP) and convolutional U-Net architectures, benchmarking them against non-linear least-squares (NLLS) fitting and Diffusion Kurtosis Imaging across clinical- to ultra-strong gradient systems. Dense ssVERDICT at ultra-strong gradient notably outperformed NLLS VERDICT, boosting median CNR by 47%, cutting inter-patient Coefficient of Variation by 52%, and reducing pooled f_ic variation by 50%. Overall, it delivered the highest CNR, the most stable parameter estimates, and the clearest tumour-normal contrast compared with conventional methods and clinical gradient systems. These findings highlight the potential of advanced gradient systems and deep learning-based modelling to improve non-invasive prostate cancer characterization and reduce unnecessary biopsies.
Read more →
InvertiTune: High-Quality Data Synthesis for Cost-Effective Single-Shot Text-to-Knowledge Graph Generation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03197v1 Announce Type: cross Abstract: Large Language Models (LLMs) have revolutionized the ability to understand and generate text, enabling significant progress in automatic knowledge graph construction from text (Text2KG). Many Text2KG methods, however, rely on iterative LLM prompting, making them computationally expensive and prone to overlooking complex relations distributed throughout the text. To address these limitations, we propose InvertiTune, a framework that combines a controlled data generation pipeline with supervised fine-tuning (SFT). Within this framework, the data-generation pipeline systematically extracts subgraphs from large knowledge bases, applies noise filtering, and leverages LLMs to generate corresponding natural text descriptions, a task more aligned with LLM capabilities than direct KG generation from text. This pipeline enables generating datasets composed of longer texts paired with larger KGs that better reflect real-world scenarios compared to existing benchmarks, thus supporting effective SFT of lightweight models for single-shot KG construction. Experimental results on CE12k, a dataset generated using the introduced pipeline, show that InvertiTune outperforms larger non-fine-tuned LLMs as well as state-of-the-art Text2KG approaches, while also demonstrating stronger cross-dataset generalization on CrossEval-1200, a test set created from three established benchmark datasets and CE12k. These findings highlight the importance of realistic, high-quality training data for advancing efficient and high-performing Text2KG systems.
Read more →
How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03238v1 Announce Type: cross Abstract: High quality data is needed to unlock the full potential of AI for end users. However finding new sources of such data is getting harder: most publicly-available human generated data will soon have been used. Additionally, publicly available data often is not representative of users of a particular system -- for example, a research speech dataset of contractors interacting with an AI assistant will likely be more homogeneous, well articulated and self-censored than real world commands that end users will issue. Therefore unlocking high-quality data grounded in real user interactions is of vital interest. However, the direct use of user data comes with significant privacy risks. Differential Privacy (DP) is a well established framework for reasoning about and limiting information leakage, and is a gold standard for protecting user privacy. The focus of this work, \emph{Differentially Private Synthetic data}, refers to synthetic data that preserves the overall trends of source data,, while providing strong privacy guarantees to individuals that contributed to the source dataset. DP synthetic data can unlock the value of datasets that have previously been inaccessible due to privacy concerns and can replace the use of sensitive datasets that previously have only had rudimentary protections like ad-hoc rule-based anonymization. In this paper we explore the full suite of techniques surrounding DP synthetic data, the types of privacy protections they offer and the state-of-the-art for various modalities (image, tabular, text and decentralized). We outline all the components needed in a system that generates DP synthetic data, from sensitive data handling and preparation, to tracking the use and empirical privacy testing. We hope that work will result in increased adoption of DP synthetic data, spur additional research and increase trust in DP synthetic data approaches.
Read more →
SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03244v1 Announce Type: cross Abstract: Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning, yet their adoption remains limited by the need for expensive step-level annotations or ground truth references. We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them using parallel scaling (self-consistency) and sequential scaling (meta-critique). In the second stage, we use these verification outputs as synthetic training data to fine-tune generative process reward models, which subsequently serve as reward signals during training. We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision, achieving 67.5 F1 on ProcessBench (a benchmark for identifying erroneous steps in mathematical reasoning) compared to 66.4 for reference-guided training and 61.9 for GPT-4o. In the final stage, we apply our generative PRM with chain-of-thought verification (PRM-CoT) as the reward model in RL experiments on mathematical reasoning, and introduce format constraints to prevent reward hacking. Using Qwen2.5-Math-7B, we achieve 47.4% average accuracy across six mathematical reasoning benchmarks, outperforming ground-truth-based RLVR (43.9%). Our work enables reference-free RL training that exceeds ground-truth methods, opening new possibilities for domains lacking verifiable answers or accessible ground truth.
Read more →
Learning Network Sheaves for AI-native Semantic Communication
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03248v1 Announce Type: cross Abstract: Recent advances in AI call for a paradigm shift from bit-centric communication to goal- and semantics-oriented architectures, paving the way for AI-native 6G networks. In this context, we address a key open challenge: enabling heterogeneous AI agents to exchange compressed latent-space representations while mitigating semantic noise and preserving task-relevant meaning. We cast this challenge as learning both the communication topology and the alignment maps that govern information exchange among agents, yielding a learned network sheaf equipped with orthogonal maps. This learning process is further supported by a semantic denoising end compression module that constructs a shared global semantic space and derives sparse, structured representations of each agent's latent space. This corresponds to a nonconvex dictionary learning problem solved iteratively with closed-form updates. Experiments with mutiple AI agents pre-trained on real image data show that the semantic denoising and compression facilitates AI agents alignment and the extraction of semantic clusters, while preserving high accuracy in downstream task. The resulting communication network provides new insights about semantic heterogeneity across agents, highlighting the interpretability of our methodology.
Read more →
PyroFocus: A Deep Learning Approach to Real-Time Wildfire Detection in Multispectral Remote Sensing Imagery
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03257v1 Announce Type: cross Abstract: Rapid and accurate wildfire detection is crucial for emergency response and environmental management. In airborne and spaceborne missions, real-time algorithms must distinguish between no fire, active fire, and post-fire conditions, and estimate fire intensity. Multispectral and hyperspectral thermal imagers provide rich spectral information, but high data dimensionality and limited onboard resources make real-time processing challenging. As wildfires increase in frequency and severity, the need for low-latency and computationally efficient onboard detection methods is critical. We present a systematic evaluation of multiple deep learning architectures, including custom Convolutional Neural Networks (CNNs) and Transformer-based models, for multi-class fire classification. We also introduce PyroFocus, a two-stage pipeline that performs fire classification followed by fire radiative power (FRP) regression or segmentation to reduce inference time and computational cost for onboard deployment. Using data from NASA's MODIS/ASTER Airborne Simulator (MASTER), which is similar to a next-generation fire detection sensor, we compare accuracy, inference latency, and resource efficiency. Experimental results show that the proposed two-stage pipeline achieves strong trade-offs between speed and accuracy, demonstrating significant potential for real-time edge deployment in future wildfire monitoring missions.
Read more →
Thucy: An LLM-based Multi-Agent System for Claim Verification across Relational Databases
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03278v1 Announce Type: cross Abstract: In today's age, it is becoming increasingly difficult to decipher truth from lies. Every day, politicians, media outlets, and public figures make conflicting claims$\unicode{x2014}$often about topics that can, in principle, be verified against structured data. For instance, statements about crime rates, economic growth or healthcare can all be verified against official public records and structured datasets. Building a system that can automatically do that would have sounded like science fiction just a few years ago. Yet, with the extraordinary progress in LLMs and agentic AI, this is now within reach. Still, there remains a striking gap between what is technically possible and what is being demonstrated by recent work. Most existing verification systems operate only on small, single-table databases$\unicode{x2014}$typically a few hundred rows$\unicode{x2014}$that conveniently fit within an LLM's context window. In this paper we report our progress on Thucy, the first cross-database, cross-table multi-agent claim verification system that also provides concrete evidence for each verification verdict. Thucy remains completely agnostic to the underlying data sources before deployment and must therefore autonomously discover, inspect, and reason over all available relational databases to verify claims. Importantly, Thucy also reports the exact SQL queries that support its verdict (whether the claim is accurate or not) offering full transparency to expert users familiar with SQL. When evaluated on the TabFact dataset$\unicode{x2014}$the standard benchmark for fact verification over structured data$\unicode{x2014}$Thucy surpasses the previous state of the art by 5.6 percentage points in accuracy (94.3% vs. 88.7%).
Read more →
BlendedNet++: A Large-Scale Blended Wing Body Aerodynamics Dataset and Benchmark
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03280v1 Announce Type: cross Abstract: Despite progress in machine learning-based aerodynamic surrogates, the scarcity of large, field-resolved datasets limits progress on accurate pointwise prediction and reproducible inverse design for aircraft. We introduce BlendedNet++, a large-scale aerodynamic dataset and benchmark focused on blended wing body (BWB) aircraft. The dataset contains over 12,000 unique geometries, each simulated at a single flight condition, yielding 12,490 aerodynamic results for steady RANS CFD. For every case, we provide (i) integrated force/moment coefficients CL, CD, CM and (ii) dense surface fields of pressure and skin friction coefficients Cp and (Cfx, Cfy, Cfz). Using this dataset, we standardize a forward-surrogate benchmark to predict pointwise fields across six model families: GraphSAGE, GraphUNet, PointNet, a coordinate Transformer (Transolver-style), a FiLMNet (coordinate MLP with feature-wise modulation), and a Graph Neural Operator Transformer (GNOT). Finally, we present an inverse design task of achieving a specified lift-to-drag ratio under fixed flight conditions, implemented via a conditional diffusion model. To assess performance, we benchmark this approach against gradient-based optimization on the same surrogate and a diffusion-optimization hybrid that first samples with the conditional diffusion model and then further optimizes the designs. BlendedNet++ provides a unified forward and inverse protocol with multi-model baselines, enabling fair, reproducible comparison across architectures and optimization paradigms. We expect BlendedNet++ to catalyze reproducible research in field-level aerodynamics and inverse design; resources (dataset, splits, baselines, and scripts) will be released upon acceptance.
Read more →
Adaptive Regime-Switching Forecasts with Distribution-Free Uncertainty: Deep Switching State-Space Models Meet Conformal Prediction
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03298v1 Announce Type: cross Abstract: Regime transitions routinely break stationarity in time series, making calibrated uncertainty as important as point accuracy. We study distribution-free uncertainty for regime-switching forecasting by coupling Deep Switching State Space Models with Adaptive Conformal Inference (ACI) and its aggregated variant (AgACI). We also introduce a unified conformal wrapper that sits atop strong sequence baselines including S4, MC-Dropout GRU, sparse Gaussian processes, and a change-point local model to produce online predictive bands with finite-sample marginal guarantees under nonstationarity and model misspecification. Across synthetic and real datasets, conformalized forecasters achieve near-nominal coverage with competitive accuracy and generally improved band efficiency.
Read more →
HydroDCM: Hydrological Domain-Conditioned Modulation for Cross-Reservoir Inflow Prediction
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03300v1 Announce Type: cross Abstract: Deep learning models have shown promise in reservoir inflow prediction, yet their performance often deteriorates when applied to different reservoirs due to distributional differences, referred to as the domain shift problem. Domain generalization (DG) solutions aim to address this issue by extracting domain-invariant representations that mitigate errors in unseen domains. However, in hydrological settings, each reservoir exhibits unique inflow patterns, while some metadata beyond observations like spatial information exerts indirect but significant influence. This mismatch limits the applicability of conventional DG techniques to many-domain hydrological systems. To overcome these challenges, we propose HydroDCM, a scalable DG framework for cross-reservoir inflow forecasting. Spatial metadata of reservoirs is used to construct pseudo-domain labels that guide adversarial learning of invariant temporal features. During inference, HydroDCM adapts these features through light-weight conditioning layers informed by the target reservoir's metadata, reconciling DG's invariance with location-specific adaptation. Experiment results on 30 real-world reservoirs in the Upper Colorado River Basin demonstrate that our method substantially outperforms state-of-the-art DG baselines under many-domain conditions and remains computationally efficient.
Read more →
Robust Tabular Foundation Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03307v1 Announce Type: cross Abstract: The development of tabular foundation models (TFMs) has accelerated in recent years, showing strong potential to outperform traditional ML methods for structured data. A key finding is that TFMs can be pretrained entirely on synthetic datasets, opening opportunities to design data generators that encourage desirable model properties. Prior work has mainly focused on crafting high-quality priors over generators to improve overall pretraining performance. Our insight is that parameterizing the generator distribution enables an adversarial robustness perspective: during training, we can adapt the generator to emphasize datasets that are particularly challenging for the model. We formalize this by introducing an optimality gap measure, given by the difference between TFM performance and the best achievable performance as estimated by strong baselines such as XGBoost, CatBoost, and Random Forests. Building on this idea, we propose Robust Tabular Foundation Models (RTFM), a model-agnostic adversarial training framework. Applied to the TabPFN V2 classifier, RTFM improves benchmark performance, with up to a 6% increase in mean normalized AUC over the original TabPFN and other baseline algorithms, while requiring less than 100k additional synthetic datasets. These results highlight a promising new direction for targeted adversarial training and fine-tuning of TFMs using synthetic data alone.
Read more →
Retrofitting Earth System Models with Cadence-Limited Neural Operator Updates
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03309v1 Announce Type: cross Abstract: Coarse resolution, imperfect parameterizations, and uncertain initial states and forcings limit Earth-system model (ESM) predictions. Traditional bias correction via data assimilation improves constrained simulations but offers limited benefit once models run freely. We introduce an operator-learning framework that maps instantaneous model states to bias-correction tendencies and applies them online during integration. Building on a U-Net backbone, we develop two operator architectures Inception U-Net (IUNet) and a multi-scale network (M\&M) that combine diverse upsampling and receptive fields to capture multiscale nonlinear features under Energy Exascale Earth System Model (E3SM) runtime constraints. Trained on two years E3SM simulations nudged toward ERA5 reanalysis, the operators generalize across height levels and seasons. Both architectures outperform standard U-Net baselines in offline tests, indicating that functional richness rather than parameter count drives performance. In online hybrid E3SM runs, M\&M delivers the most consistent bias reductions across variables and vertical levels. The ML-augmented configurations remain stable and computationally feasible in multi-year simulations, providing a practical pathway for scalable hybrid modeling. Our framework emphasizes long-term stability, portability, and cadence-limited updates, demonstrating the utility of expressive ML operators for learning structured, cross-scale relationships and retrofitting legacy ESMs.
Read more →
NavMapFusion: Diffusion-based Fusion of Navigation Maps for Online Vectorized HD Map Construction
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03317v1 Announce Type: cross Abstract: Accurate environmental representations are essential for autonomous driving, providing the foundation for safe and efficient navigation. Traditionally, high-definition (HD) maps are providing this representation of the static road infrastructure to the autonomous system a priori. However, because the real world is constantly changing, such maps must be constructed online from on-board sensor data. Navigation-grade standard-definition (SD) maps are widely available, but their resolution is insufficient for direct deployment. Instead, they can be used as coarse prior to guide the online map construction process. We propose NavMapFusion, a diffusion-based framework that performs iterative denoising conditioned on high-fidelity sensor data and on low-fidelity navigation maps. This paper strives to answer: (1) How can coarse, potentially outdated navigation maps guide online map construction? (2) What advantages do diffusion models offer for map fusion? We demonstrate that diffusion-based map construction provides a robust framework for map fusion. Our key insight is that discrepancies between the prior map and online perception naturally correspond to noise within the diffusion process; consistent regions reinforce the map construction, whereas outdated segments are suppressed. On the nuScenes benchmark, NavMapFusion conditioned on coarse road lines from OpenStreetMap data reaches a 21.4% relative improvement on 100 m, and even stronger improvements on larger perception ranges, while maintaining real-time capabilities. By fusing low-fidelity priors with high-fidelity sensor data, the proposed method generates accurate and up-to-date environment representations, guiding towards safer and more reliable autonomous driving. The code is available at https://github.com/tmonnin/navmapfusion
Read more →
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03324v1 Announce Type: cross Abstract: Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
Read more →
Single-Round Scalable Analytic Federated Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03336v1 Announce Type: cross Abstract: Federated Learning (FL) is plagued by two key challenges: high communication overhead and performance collapse on heterogeneous (non-IID) data. Analytic FL (AFL) provides a single-round, data distribution invariant solution, but is limited to linear models. Subsequent non-linear approaches, like DeepAFL, regain accuracy but sacrifice the single-round benefit. In this work, we break this trade-off. We propose SAFLe, a framework that achieves scalable non-linear expressivity by introducing a structured head of bucketed features and sparse, grouped embeddings. We prove this non-linear architecture is mathematically equivalent to a high-dimensional linear regression. This key equivalence allows SAFLe to be solved with AFL's single-shot, invariant aggregation law. Empirically, SAFLe establishes a new state-of-the-art for analytic FL, significantly outperforming both linear AFL and multi-round DeepAFL in accuracy across all benchmarks, demonstrating a highly efficient and scalable solution for federated vision.
Read more →
ProtoEFNet: Dynamic Prototype Learning for Inherently Interpretable Ejection Fraction Estimation in Echocardiography
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03339v1 Announce Type: cross Abstract: Ejection fraction (EF) is a crucial metric for assessing cardiac function and diagnosing conditions such as heart failure. Traditionally, EF estimation requires manual tracing and domain expertise, making the process time-consuming and subject to interobserver variability. Most current deep learning methods for EF prediction are black-box models with limited transparency, which reduces clinical trust. Some post-hoc explainability methods have been proposed to interpret the decision-making process after the prediction is made. However, these explanations do not guide the model's internal reasoning and therefore offer limited reliability in clinical applications. To address this, we introduce ProtoEFNet, a novel video-based prototype learning model for continuous EF regression. The model learns dynamic spatiotemporal prototypes that capture clinically meaningful cardiac motion patterns. Additionally, the proposed Prototype Angular Separation (PAS) loss enforces discriminative representations across the continuous EF spectrum. Our experiments on the EchonetDynamic dataset show that ProtoEFNet can achieve accuracy on par with its non-interpretable counterpart while providing clinically relevant insight. The ablation study shows that the proposed loss boosts performance with a 2% increase in F1 score from 77.67$\pm$2.68 to 79.64$\pm$2.10. Our source code is available at: https://github.com/DeepRCL/ProtoEF
Read more →
Idea-Gated Transformers: Enforcing Semantic Coherence via Differentiable Vocabulary Pruning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03343v1 Announce Type: cross Abstract: Autoregressive Language Models (LLMs) trained on Next-Token Prediction (NTP) often suffer from ``Topic Drift'' where the generation wanders away from the initial prompt due to a reliance on local associations rather than global planning \citep{holtzman2019curious}. While scaling model size mitigates this \citep{brown2020language}, the fundamental myopia of the NTP objective remains. In this work, we introduce the Idea-Gated Transformer, a novel architecture that separates semantic planning from syntactic generation. We introduce an auxiliary ``Idea Head'' trained to predict the bag-of-words distribution for a future context window, creating a latent ``Concept Vector'' that actively gates the main vocabulary during generation. We propose a differentiable gating mechanism that suppresses semantically irrelevant tokens, effectively pruning the search space in real-time. Experiments on WikiText-103 demonstrate that while the Idea-Gated model achieves comparable validation perplexity to a standard GPT-2 baseline, it exhibits significantly superior Domain Retention. Qualitative and quantitative analysis reveals that the gating mechanism successfully locks generation into specific semantic clusters (e.g., Finance, Science) and resists associative drift, offering a parameter-efficient path toward more controllable language modeling.
Read more →
HalluGen: Synthesizing Realistic and Controllable Hallucinations for Evaluating Image Restoration
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03345v1 Announce Type: cross Abstract: Generative models are prone to hallucinations: plausible but incorrect structures absent in the ground truth. This issue is problematic in image restoration for safety-critical domains such as medical imaging, industrial inspection, and remote sensing, where such errors undermine reliability and trust. For example, in low-field MRI, widely used in resource-limited settings, restoration models are essential for enhancing low-quality scans, yet hallucinations can lead to serious diagnostic errors. Progress has been hindered by a circular dependency: evaluating hallucinations requires labeled data, yet such labels are costly and subjective. We introduce HalluGen, a diffusion-based framework that synthesizes realistic hallucinations with controllable type, location, and severity, producing perceptually realistic but semantically incorrect outputs (segmentation IoU drops from 0.86 to 0.36). Using HalluGen, we construct the first large-scale hallucination dataset comprising 4,350 annotated images derived from 1,450 brain MR images for low-field enhancement, enabling systematic evaluation of hallucination detection and mitigation. We demonstrate its utility in two applications: (1) benchmarking image quality metrics and developing Semantic Hallucination Assessment via Feature Evaluation (SHAFE), a feature-based metric with soft-attention pooling that improves hallucination sensitivity over traditional metrics; and (2) training reference-free hallucination detectors that generalize to real restoration failures. Together, HalluGen and its open dataset establish the first scalable foundation for evaluating hallucinations in safety-critical image restoration.
Read more →
FireSentry: A Multi-Modal Spatio-temporal Benchmark Dataset for Fine-Grained Wildfire Spread Forecasting
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03369v1 Announce Type: cross Abstract: Fine-grained wildfire spread prediction is crucial for enhancing emergency response efficacy and decision-making precision. However, existing research predominantly focuses on coarse spatiotemporal scales and relies on low-resolution satellite data, capturing only macroscopic fire states while fundamentally constraining high-precision localized fire dynamics modeling capabilities. To bridge this gap, we present FireSentry, a provincial-scale multi-modal wildfire dataset characterized by sub-meter spatial and sub-second temporal resolution. Collected using synchronized UAV platforms, FireSentry provides visible and infrared video streams, in-situ environmental measurements, and manually validated fire masks. Building on FireSentry, we establish a comprehensive benchmark encompassing physics-based, data-driven, and generative models, revealing the limitations of existing mask-only approaches. Our analysis proposes FiReDiff, a novel dual-modality paradigm that first predicts future video sequences in the infrared modality, and then precisely segments fire masks in the mask modality based on the generated dynamics. FiReDiff achieves state-of-the-art performance, with video quality gains of 39.2% in PSNR, 36.1% in SSIM, 50.0% in LPIPS, 29.4% in FVD, and mask accuracy gains of 3.3% in AUPRC, 59.1% in F1 score, 42.9% in IoU, and 62.5% in MSE when applied to generative models. The FireSentry benchmark dataset and FiReDiff paradigm collectively advance fine-grained wildfire forecasting and dynamic disaster simulation. The processed benchmark dataset is publicly available at: https://github.com/Munan222/FireSentry-Benchmark-Dataset.
Read more →
UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03383v1 Announce Type: cross Abstract: Deploying large language model (LLM) models on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our framework performs weight-sorting, fine-tuning, and quantization in the cloud in a single-pass workflow, while enabling on-device configurable pruning rates up to 35%. Our experiments show that quantized and pruned models achieve a memory reduction of 4x-5.7x and a token-throughput improvement of 2.7x-3.4x, maintaining accuracy within 5% of the original models at 15% pruning across Transformers (Llama3 and Qwen2.5), SSMs (Mamba2), and hybrid models (Nemotron-H and Bamba-v2). The code and quantized models are available at: https://github.com/enyac-group/UniQL.
Read more →
VS-Graph: Scalable and Efficient Graph Classification Using Hyperdimensional Computing
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03394v1 Announce Type: cross Abstract: Graph classification is a fundamental task in domains ranging from molecular property prediction to materials design. While graph neural networks (GNNs) achieve strong performance by learning expressive representations via message passing, they incur high computational costs, limiting their scalability and deployment on resource-constrained devices. Hyperdimensional Computing (HDC), also known as Vector Symbolic Architectures (VSA), offers a lightweight, brain-inspired alternative, yet existing HDC-based graph methods typically struggle to match the predictive performance of GNNs. In this work, we propose VS-Graph, a vector-symbolic graph learning framework that narrows the gap between the efficiency of HDC and the expressive power of message passing. VS-Graph introduces a Spike Diffusion mechanism for topology-driven node identification and an Associative Message Passing scheme for multi-hop neighborhood aggregation entirely within the high-dimensional vector space. Without gradient-based optimization or backpropagation, our method achieves competitive accuracy with modern GNNs, outperforming the prior HDC baseline by 4-5% on standard benchmarks such as MUTAG and DD. It also matches or exceeds the performance of the GNN baselines on several datasets while accelerating the training by a factor of up to 450x. Furthermore, VS-Graph maintains high accuracy even with the hypervector dimensionality reduced to D=128, demonstrating robustness under aggressive dimension compression and paving the way for ultra-efficient execution on edge and neuromorphic hardware.
Read more →
Better World Models Can Lead to Better Post-Training Performance
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03400v1 Announce Type: cross Abstract: In this work we study how explicit world-modeling objectives affect the internal representations and downstream capability of Transformers across different training stages. We use a controlled 2x2x2 Rubik's Cube and ask: (1) how does explicitly pretraining a world model affect the model's latent representations, and (2) how does world-model quality affect the model's performance after reinforcement learning post-training? We compare standard next-token prediction to two explicit world-modeling strategies -- (i) state-prediction pretraining and (ii) a joint state-prediction + next-token objective -- and assess task performance after Group Relative Policy Optimization (GRPO) is applied as post-training. We evaluate the representation quality with linear probes and causal interventions. We find that explicit world-modeling yields more linearly decodable and causally steerable state representations. More importantly, we find that improved state representations lead to higher gains for GRPO, especially on harder cube states. Our results indicate that sharpening state representations can improve the effectiveness of post-training for sequence-planning tasks.
Read more →
BookRAG: A Hierarchical Structure-aware Index-based Approach for Retrieval-Augmented Generation on Complex Documents
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03413v1 Announce Type: cross Abstract: As an effective method to boost the performance of Large Language Models (LLMs) on the question answering (QA) task, Retrieval-Augmented Generation (RAG), which queries highly relevant information from external complex documents, has attracted tremendous attention from both industry and academia. Existing RAG approaches often focus on general documents, and they overlook the fact that many real-world documents (such as books, booklets, handbooks, etc.) have a hierarchical structure, which organizes their content from different granularity levels, leading to poor performance for the QA task. To address these limitations, we introduce BookRAG, a novel RAG approach targeted for documents with a hierarchical structure, which exploits logical hierarchies and traces entity relations to query the highly relevant information. Specifically, we build a novel index structure, called BookIndex, by extracting a hierarchical tree from the document, which serves as the role of its table of contents, using a graph to capture the intricate relationships between entities, and mapping entities to tree nodes. Leveraging the BookIndex, we then propose an agent-based query method inspired by the Information Foraging Theory, which dynamically classifies queries and employs a tailored retrieval workflow. Extensive experiments on three widely adopted benchmarks demonstrate that BookRAG achieves state-of-the-art performance, significantly outperforming baselines in both retrieval recall and QA accuracy while maintaining competitive efficiency.
Read more →
World Models for Autonomous Navigation of Terrestrial Robots from LIDAR Observations
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03429v1 Announce Type: cross Abstract: Autonomous navigation of terrestrial robots using Reinforcement Learning (RL) from LIDAR observations remains challenging due to the high dimensionality of sensor data and the sample inefficiency of model-free approaches. Conventional policy networks struggle to process full-resolution LIDAR inputs, forcing prior works to rely on simplified observations that reduce spatial awareness and navigation robustness. This paper presents a novel model-based RL framework built on top of the DreamerV3 algorithm, integrating a Multi-Layer Perceptron Variational Autoencoder (MLP-VAE) within a world model to encode high-dimensional LIDAR readings into compact latent representations. These latent features, combined with a learned dynamics predictor, enable efficient imagination-based policy optimization. Experiments on simulated TurtleBot3 navigation tasks demonstrate that the proposed architecture achieves faster convergence and higher success rate compared to model-free baselines such as SAC, DDPG, and TD3. It is worth emphasizing that the DreamerV3-based agent attains a 100% success rate across all evaluated environments when using the full dataset of the Turtlebot3 LIDAR (360 readings), while model-free methods plateaued below 85%. These findings demonstrate that integrating predictive world models with learned latent representations enables more efficient and robust navigation from high-dimensional sensory data.
Read more →
Multi-Aspect Knowledge-Enhanced Medical Vision-Language Pretraining with Multi-Agent Data Generation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03445v1 Announce Type: cross Abstract: Vision-language pretraining (VLP) has emerged as a powerful paradigm in medical image analysis, enabling representation learning from large-scale image-text pairs without relying on expensive manual annotations. However, existing methods often struggle with the noise inherent in web-collected data and the complexity of unstructured long medical texts. To address these challenges, we propose a novel VLP framework integrating a Multi-Agent data GENeration (MAGEN) system and Ontology-based Multi-Aspect Knowledge-Enhanced (O-MAKE) pretraining. First, MAGEN enhances data quality by synthesizing knowledge-enriched descriptions via a foundation model-assisted captioning and retrieval-based verification pipeline. Second, O-MAKE addresses the difficulty of learning from long, unstructured texts by decomposing them into distinct knowledge aspects. This facilitates fine-grained alignment at both global and patch levels, while explicitly modeling medical concept relationships through ontology-guided mechanisms. We validate our framework in the field of dermatology, where comprehensive experiments demonstrate the effectiveness of each component. Our approach achieves state-of-the-art zero-shot performance on disease classification and cross-modal retrieval tasks across eight datasets. Our code and the augmented dataset Derm1M-AgentAug, comprising over 400k skin-image-text pairs, will be released at https://github.com/SiyuanYan1/Derm1M.
Read more →
GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03451v1 Announce Type: cross Abstract: Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications. We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
Read more →
Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03454v1 Announce Type: cross Abstract: Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.
Read more →
Learning From Limited Data and Feedback for Cell Culture Process Monitoring: A Comparative Study
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03460v1 Announce Type: cross Abstract: In cell culture bioprocessing, real-time batch process monitoring (BPM) refers to the continuous tracking and analysis of key process variables such as viable cell density, nutrient levels, metabolite concentrations, and product titer throughout the duration of a batch run. This enables early detection of deviations and supports timely control actions to ensure optimal cell growth and product quality. BPM plays a critical role in ensuring the quality and regulatory compliance of biopharmaceutical manufacturing processes. However, the development of accurate soft sensors for BPM is hindered by key challenges, including limited historical data, infrequent feedback, heterogeneous process conditions, and high-dimensional sensory inputs. This study presents a comprehensive benchmarking analysis of machine learning (ML) methods designed to address these challenges, with a focus on learning from historical data with limited volume and relevance in the context of bioprocess monitoring. We evaluate multiple ML approaches including feature dimensionality reduction, online learning, and just-in-time learning across three datasets, one in silico dataset and two real-world experimental datasets. Our findings highlight the importance of training strategies in handling limited data and feedback, with batch learning proving effective in homogeneous settings, while just-in-time learning and online learning demonstrate superior adaptability in cold-start scenarios. Additionally, we identify key meta-features, such as feed media composition and process control strategies, that significantly impact model transferability. The results also suggest that integrating Raman-based predictions with lagged offline measurements enhances monitoring accuracy, offering a promising direction for future bioprocess soft sensor development.
Read more →
Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03463v1 Announce Type: cross Abstract: Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we propose a Text-Printed Image (TPI), which generates synthetic images by directly rendering the given textual description on a plain white canvas. This simple rendering projects text into the image modality and can be integrated into arbitrary existing LVLM training pipelines at low cost. Moreover, TPI preserves the semantics of the text, whereas text-to-image models often fail to do. Across four models and seven benchmarks, our systematic experiments show that TPI enables more effective text-centric training than synthetic images generated by a diffusion model. We further explore TPI as a low-cost data-augmentation strategy and demonstrate its practical utility. Overall, our findings highlight the significant potential of text-centric training and, more broadly, chart a path toward fully automated data generation for LVLMs.
Read more →
AsymPuzl: An Asymmetric Puzzle for multi-agent cooperation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03466v1 Announce Type: cross Abstract: Large Language Model (LLM) agents are increasingly studied in multi-turn, multi-agent scenarios, yet most existing setups emphasize open-ended role-play rather than controlled evaluation. We introduce AsymPuzl, a minimal but expressive two-agent puzzle environment designed to isolate communication under information asymmetry. Each agent observes complementary but incomplete views of a symbolic puzzle and must exchange messages to solve it cooperatively. Using a diverse set of current-generation and open-source LLMs, we show that (i) strong models such as GPT-5 and Claude-4.0 reliably converge across puzzle sizes on the solution by sharing complete information in two turns, (ii) weaker models often ignore partner messages or over-correct their hypotheses, and (iii) feedback design is non-trivial: simple self-feedback improves success rates, while detailed joint feedback can hurt performance. These findings show that even in simple cooperative tasks, LLM communication strategies diverge and depend on the granularity of feedback signals. AsymPuzl thus provides a testbed for probing the limits of multi-turn cooperation and opens avenues for studying coordination mechanisms.
Read more →
ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03476v1 Announce Type: cross Abstract: Bridging the gap between theoretical conceptualization and computational implementation is a major bottleneck in Scientific Computing (SciC) and Scientific Machine Learning (SciML). We introduce ATHENA (Agentic Team for Hierarchical Evolutionary Numerical Algorithms), an agentic framework designed as an Autonomous Lab to manage the end-to-end computational research lifecycle. Its core is the HENA loop, a knowledge-driven diagnostic process framed as a Contextual Bandit problem. Acting as an online learner, the system analyzes prior trials to select structural `actions' ($A_n$) from combinatorial spaces guided by expert blueprints (e.g., Universal Approximation, Physics-Informed constraints). These actions are translated into executable code ($S_n$) to generate scientific rewards ($R_n$). ATHENA transcends standard automation: in SciC, it autonomously identifies mathematical symmetries for exact analytical solutions or derives stable numerical solvers where foundation models fail. In SciML, it performs deep diagnosis to tackle ill-posed formulations and combines hybrid symbolic-numeric workflows (e.g., coupling PINNs with FEM) to resolve multiphysics problems. The framework achieves super-human performance, reaching validation errors of $10^{-14}$. Furthermore, collaborative ``human-in-the-loop" intervention allows the system to bridge stability gaps, improving results by an order of magnitude. This paradigm shift focuses from implementation mechanics to methodological innovation, accelerating scientific discovery.
Read more →
Cell-cell communication inference and analysis: biological mechanisms, computational approaches, and future opportunities
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03497v1 Announce Type: cross Abstract: In multicellular organisms, cells coordinate their activities through cell-cell communication (CCC), which are crucial for development, tissue homeostasis, and disease progression. Recent advances in single-cell and spatial omics technologies provide unprecedented opportunities to systematically infer and analyze CCC from these omics data, either by integrating prior knowledge of ligand-receptor interactions (LRIs) or through de novo approaches. A variety of computational methods have been developed, focusing on methodological innovations, accurate modeling of complex signaling mechanisms, and investigation of broader biological questions. These advances have greatly enhanced our ability to analyze CCC and generate biological hypotheses. Here, we introduce the biological mechanisms and modeling strategies of CCC, and provide a focused overview of more than 140 computational methods for inferring CCC from single-cell and spatial transcriptomic data, emphasizing the diversity in methodological frameworks and biological questions. Finally, we discuss the current challenges and future opportunities in this rapidly evolving field.
Read more →
NAS-LoRA: Empowering Parameter-Efficient Fine-Tuning for Visual Foundation Models with Searchable Adaptation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03499v1 Announce Type: cross Abstract: The Segment Anything Model (SAM) has emerged as a powerful visual foundation model for image segmentation. However, adapting SAM to specific downstream tasks, such as medical and agricultural imaging, remains a significant challenge. To address this, Low-Rank Adaptation (LoRA) and its variants have been widely employed to enhancing SAM's adaptation performance on diverse domains. Despite advancements, a critical question arises: can we integrate inductive bias into the model? This is particularly relevant since the Transformer encoder in SAM inherently lacks spatial priors within image patches, potentially hindering the acquisition of high-level semantic information. In this paper, we propose NAS-LoRA, a new Parameter-Efficient Fine-Tuning (PEFT) method designed to bridge the semantic gap between pre-trained SAM and specialized domains. Specifically, NAS-LoRA incorporates a lightweight Neural Architecture Search (NAS) block between the encoder and decoder components of LoRA to dynamically optimize the prior knowledge integrated into weight updates. Furthermore, we propose a stage-wise optimization strategy to help the ViT encoder balance weight updates and architectural adjustments, facilitating the gradual learning of high-level semantic information. Various Experiments demonstrate our NAS-LoRA improves existing PEFT methods, while reducing training cost by 24.14% without increasing inference cost, highlighting the potential of NAS in enhancing PEFT for visual foundation models.
Read more →
Physics-Driven Learning Framework for Tomographic Tactile Sensing
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03512v1 Announce Type: cross Abstract: Electrical impedance tomography (EIT) provides an attractive solution for large-area tactile sensing due to its minimal wiring and shape flexibility, but its nonlinear inverse problem often leads to severe artifacts and inaccurate contact reconstruction. This work presents PhyDNN, a physics-driven deep reconstruction framework that embeds the EIT forward model directly into the learning objective. By jointly minimizing the discrepancy between predicted and ground-truth conductivity maps and enforcing consistency with the forward PDE, PhyDNN reduces the black-box nature of deep networks and improves both physical plausibility and generalization. To enable efficient backpropagation, we design a differentiable forward-operator network that accurately approximates the nonlinear EIT response, allowing fast physics-guided training. Extensive simulations and real tactile experiments on a 16-electrode soft sensor show that PhyDNN consistently outperforms NOSER, TV, and standard DNNs in reconstructing contact shape, location, and pressure distribution. PhyDNN yields fewer artifacts, sharper boundaries, and higher metric scores, demonstrating its effectiveness for high-quality tomographic tactile sensing.
Read more →
M3DR: Towards Universal Multilingual Multimodal Document Retrieval
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03514v1 Announce Type: cross Abstract: Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.
Read more →
Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03534v1 Announce Type: cross Abstract: Achieving precise alignment between user intent and generated visuals remains a central challenge in text-to-visual generation, as a single attempt often fails to produce the desired output. To handle this, prior approaches mainly scale the visual generation process (e.g., increasing sampling steps or seeds), but this quickly leads to a quality plateau. This limitation arises because the prompt, crucial for guiding generation, is kept fixed. To address this, we propose Prompt Redesign for Inference-time Scaling, coined PRIS, a framework that adaptively revises the prompt during inference in response to the scaled visual generations. The core idea of PRIS is to review the generated visuals, identify recurring failure patterns across visuals, and redesign the prompt accordingly before regenerating the visuals with the revised prompt. To provide precise alignment feedback for prompt revision, we introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level, achieving more accurate and interpretable assessments than holistic measures. Extensive experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach, including a 15% gain on VBench 2.0. These results highlight that jointly scaling prompts and visuals is key to fully leveraging scaling laws at inference-time. Visualizations are available at the website: https://subin-kim-cv.github.io/PRIS.
Read more →
CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03540v1 Announce Type: cross Abstract: Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
Read more →
V-ITI: Mitigating Hallucinations in Multimodal Large Language Models via Visual Inference-Time Intervention
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03542v1 Announce Type: cross Abstract: Multimodal Large Language Models (MLLMs) excel in numerous vision-language tasks yet suffer from hallucinations, producing content inconsistent with input visuals, that undermine reliability in precision-sensitive domains. This issue stems from a fundamental problem of visual neglect, where models fail to adequately prioritize input images. Existing methods typically alleviate hallucinations by intervening in the attention score or output logits, focusing on "how to intervene" but overlooking the prerequisite "when to intervene", which leads to the "over-intervention" problem and subsequently introduces new hallucinations and unnecessary computational overhead. To address this gap, we first investigate the mechanism of visual neglect and reveal it can be accurately detected via head-level activation patterns in MLLMs. We thus propose V-ITI, a lightweight visual inference-time intervention framework integrating a Visual Neglect Detector that identifies visual neglect via head-level discriminative probes and a Visual Recall Intervenor that modulates activations with prestored visual activation information only when the visual neglect is detected. Extensive experiments across eight benchmarks and different MLLM families demonstrate that V-ITI consistently mitigates vision-related hallucinations while preserving general task performance.
Read more →
A Learning-based Control Methodology for Transitioning VTOL UAVs
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03548v1 Announce Type: cross Abstract: Transition control poses a critical challenge in Vertical Take-Off and Landing Unmanned Aerial Vehicle (VTOL UAV) development due to the tilting rotor mechanism, which shifts the center of gravity and thrust direction during transitions. Current control methods' decoupled control of altitude and position leads to significant vibration, and limits interaction consideration and adaptability. In this study, we propose a novel coupled transition control methodology based on reinforcement learning (RL) driven controller. Besides, contrasting to the conventional phase-transition approach, the ST3M method demonstrates a new perspective by treating cruise mode as a special case of hover. We validate the feasibility of applying our method in simulation and real-world environments, demonstrating efficient controller development and migration while accurately controlling UAV position and attitude, exhibiting outstanding trajectory tracking and reduced vibrations during the transition process.
Read more →
Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03553v1 Announce Type: cross Abstract: Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.
Read more →
State Space Models for Bioacoustics: A comparative Evaluation with Transformers
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03563v1 Announce Type: cross Abstract: In this study, we evaluate the efficacy of the Mamba model in the field of bioacoustics. We first pretrain a Mamba-based audio large language model (LLM) on a large corpus of audio data using self-supervised learning. We fine-tune and evaluate BioMamba on the BEANS benchmark, a collection of diverse bioacoustic tasks including classification and detection, and compare its performance and efficiency with multiple baseline models, including AVES, a state-of-the-art Transformer-based model. The results show that BioMamba achieves comparable performance with AVES while consumption significantly less VRAM, demonstrating its potential in this domain.
Read more →
Machine Learning to Predict Slot Usage in TSCH Wireless Sensor Networks
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03570v1 Announce Type: cross Abstract: Wireless sensor networks (WSNs) are employed across a wide range of industrial applications where ultra-low power consumption is a critical prerequisite. At the same time, these systems must maintain a certain level of determinism to ensure reliable and predictable operation. In this view, time slotted channel hopping (TSCH) is a communication technology that meets both conditions, making it an attractive option for its usage in industrial WSNs. This work proposes the use of machine learning to learn the traffic pattern generated in networks based on the TSCH protocol, in order to turn nodes into a deep sleep state when no transmission is planned and thus to improve the energy efficiency of the WSN. The ability of machine learning models to make good predictions at different network levels in a typical tree network topology was analyzed in depth, showing how their capabilities degrade while approaching the root of the tree. The application of these models on simulated data based on an accurate modeling of wireless sensor nodes indicates that the investigated algorithms can be suitably used to further and substantially reduce the power consumption of a TSCH network.
Read more →
When, How Long and How Much? Interpretable Neural Networks for Time Series Regression by Learning to Mask and Aggregate
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03578v1 Announce Type: cross Abstract: Time series extrinsic regression (TSER) refers to the task of predicting a continuous target variable from an input time series. It appears in many domains, including healthcare, finance, environmental monitoring, and engineering. In these settings, accurate predictions and trustworthy reasoning are both essential. Although state-of-the-art TSER models achieve strong predictive performance, they typically operate as black boxes, making it difficult to understand which temporal patterns drive their decisions. Post-hoc interpretability techniques, such as feature attribution, aim to to explain how the model arrives at its predictions, but often produce coarse, noisy, or unstable explanations. Recently, inherently interpretable approaches based on concepts, additive decompositions, or symbolic regression, have emerged as promising alternatives. However, these approaches remain limited: they require explicit supervision on the concepts themselves, often cannot capture interactions between time-series features, lack expressiveness for complex temporal patterns, and struggle to scale to high-dimensional multivariate data. To address these limitations, we propose MAGNETS (Mask-and-AGgregate NEtwork for Time Series), an inherently interpretable neural architecture for TSER. MAGNETS learns a compact set of human-understandable concepts without requiring any annotations. Each concept corresponds to a learned, mask-based aggregation over selected input features, explicitly revealing both which features drive predictions and when they matter in the sequence. Predictions are formed as combinations of these learned concepts through a transparent, additive structure, enabling clear insight into the model's decision process.
Read more →
Fine-grained Narrative Classification in Biased News Articles
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03582v1 Announce Type: cross Abstract: Narratives are the cognitive and emotional scaffolds of propaganda. They organize isolated persuasive techniques into coherent stories that justify actions, attribute blame, and evoke identification with ideological camps. In this paper, we propose a novel fine-grained narrative classification in biased news articles. We also explore article-bias classification as the precursor task to narrative classification and fine-grained persuasive technique identification. We develop INDI-PROP, the first ideologically grounded fine-grained narrative dataset with multi-level annotation for analyzing propaganda in Indian news media. Our dataset INDI-PROP comprises 1,266 articles focusing on two polarizing socio-political events in recent times: CAA and the Farmers' protest. Each article is annotated at three hierarchical levels: (i) ideological article-bias (pro-government, pro-opposition, neutral), (ii) event-specific fine-grained narrative frames anchored in ideological polarity and communicative intent, and (iii) persuasive techniques. We propose FANTA and TPTC, two GPT-4o-mini guided multi-hop prompt-based reasoning frameworks for the bias, narrative, and persuasive technique classification. FANTA leverages multi-layered communicative phenomena by integrating information extraction and contextual framing for hierarchical reasoning. On the other hand, TPTC adopts systematic decomposition of persuasive cues via a two-stage approach. Our evaluation suggests substantial improvement over underlying baselines in each case.
Read more →
KVNAND: Efficient On-Device Large Language Model Inference Using DRAM-Free In-Flash Computing
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03608v1 Announce Type: cross Abstract: Deploying large language models (LLMs) on edge devices enables personalized agents with strong privacy and low cost. However, with tens to hundreds of billions of parameters, single-batch autoregressive inference suffers from extremely low arithmetic intensity, creating severe weight-loading and bandwidth pressures on resource-constrained platforms. Recent in-flash computing (IFC) solutions alleviate this bottleneck by co-locating weight-related linear computations in the decode phase with flash, yet still rely on DRAM for the key-value (KV) cache. As context length grows, the KV cache can exceed model weights in size, imposing prohibitive DRAM cost and capacity requirements. Attempts to offload KV cache to flash suffer from severe performance penalties. We propose KVNAND, the first DRAM-free, IFC-based architecture that stores both model weights and KV cache entirely in compute-enabled 3D NAND flash. KVNAND addresses the fundamental performance challenges of flash under intensive KV cache access by leveraging IFC for all memory-bound operations to reduce data transfer overhead, introducing head-group parallelism to boost throughput, and employing page-level KV cache mapping to align token access patterns with flash organization. In addition, we propose a design space exploration framework that evaluates discrete and compact KVNAND variants to balance weight and KV placement, automatically identifying the optimal design trade-off. These techniques mitigate latency, energy, and reliability concerns, turning flash into a practical medium for long-context KV storage. Evaluations on MHA 7B and GQA 70B LLMs show that KVNAND achieves 1.98\(\times\)/1.94\(\times\)/2.05\(\times\) geomean speedup at 128/1K/10K-token contexts compared to DRAM-equipped IFC designs and addresses out-of-memory failures at 100K context length.
Read more →
SELF: A Robust Singular Value and Eigenvalue Approach for LLM Fingerprinting
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03620v1 Announce Type: cross Abstract: The protection of Intellectual Property (IP) in Large Language Models (LLMs) represents a critical challenge in contemporary AI research. While fingerprinting techniques have emerged as a fundamental mechanism for detecting unauthorized model usage, existing methods -- whether behavior-based or structural -- suffer from vulnerabilities such as false claim attacks or susceptible to weight manipulations. To overcome these limitations, we propose SELF, a novel intrinsic weight-based fingerprinting scheme that eliminates dependency on input and inherently resists false claims. SELF achieves robust IP protection through two key innovations: 1) unique, scalable and transformation-invariant fingerprint extraction via singular value and eigenvalue decomposition of LLM attention weights, and 2) effective neural network-based fingerprint similarity comparison based on few-shot learning and data augmentation. Experimental results demonstrate SELF maintains high IP infringement detection accuracy while showing strong robustness against various downstream modifications, including quantization, pruning, and fine-tuning attacks. Our code is available at https://github.com/HanxiuZhang/SELF_v2.
Read more →
The promising potential of vision language models for the generation of textual weather forecasts
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03623v1 Announce Type: cross Abstract: Despite the promising capability of multimodal foundation models, their application to the generation of meteorological products and services remains nascent. To accelerate aspiration and adoption, we explore the novel use of a vision language model for writing the iconic Shipping Forecast text directly from video-encoded gridded weather data. These early results demonstrate promising scalable technological opportunities for enhancing production efficiency and service innovation within the weather enterprise and beyond.
Read more →
AlignCheck: a Semantic Open-Domain Metric for Factual Consistency Assessment
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03634v1 Announce Type: cross Abstract: Large Language Models have significantly advanced natural language processing tasks, but remain prone to generating incorrect or misleading but plausible arguments. This issue, known as hallucination, is particularly concerning in high-stakes domains like clinical applications, where factual inaccuracies can have severe consequences. Existing evaluation metrics fail to adequately assess factual consistency and lack interpretability, making diagnosing and mitigating errors difficult. We propose an interpretable framework for factual consistency assessment for in-domain and open-domain texts to address these limitations. Our approach decomposes text into atomic facts and introduces a flexible, schema-free methodology. Unlike previous methods with an absolute metric, we incorporate a weighted metric to enhance factual evaluation. Additionally, we propose a mechanism to control assessment complexity in intricate domains. We benchmark our approach on popular general and clinical datasets and release our code to support fact-aware model training in future research.
Read more →
MKSNet: Advanced Small Object Detection in Remote Sensing Imagery with Multi-Kernel and Dual Attention Mechanisms
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03640v1 Announce Type: cross Abstract: Deep convolutional neural networks (DCNNs) have substantially advanced object detection capabilities, particularly in remote sensing imagery. However, challenges persist, especially in detecting small objects where the high resolution of these images and the small size of target objects often result in a loss of critical information in the deeper layers of conventional CNNs. Additionally, the extensive spatial redundancy and intricate background details typical in remote-sensing images tend to obscure these small targets. To address these challenges, we introduce Multi-Kernel Selection Network (MKSNet), a novel network architecture featuring a novel Multi-Kernel Selection mechanism. The MKS mechanism utilizes large convolutional kernels to effectively capture an extensive range of contextual information. This innovative design allows for adaptive kernel size selection, significantly enhancing the network's ability to dynamically process and emphasize crucial spatial details for small object detection. Furthermore, MKSNet also incorporates a dual attention mechanism, merging spatial and channel attention modules. The spatial attention module adaptively fine-tunes the spatial weights of feature maps, focusing more intensively on relevant regions while mitigating background noise. Simultaneously, the channel attention module optimizes channel information selection, improving feature representation and detection accuracy. Empirical evaluations on the DOTA-v1.0 and HRSC2016 benchmark demonstrate that MKSNet substantially surpasses existing state-of-the-art models in detecting small objects in remote sensing images. These results highlight MKSNet's superior ability to manage the complexities associated with multi-scale and high-resolution image data, confirming its effectiveness and innovation in remote sensing object detection.
Read more →
Dynamically Scaled Activation Steering
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03661v1 Announce Type: cross Abstract: Activation steering has emerged as a powerful method for guiding the behavior of generative models towards desired outcomes such as toxicity mitigation. However, most existing methods apply interventions uniformly across all inputs, degrading model performance when steering is unnecessary. We introduce Dynamically Scaled Activation Steering (DSAS), a method-agnostic steering framework that decouples when to steer from how to steer. DSAS adaptively modulates the strength of existing steering transformations across layers and inputs, intervening strongly only when undesired behavior is detected. At generation time, DSAS computes context-dependent scaling factors that selectively adjust the strength of any steering method. We also show how DSAS can be jointly optimized end-to-end together with the steering function. When combined with existing steering methods, DSAS consistently improves the Pareto front with respect to steering alone, achieving a better trade-off between toxicity mitigation and utility preservation. We further demonstrate DSAS's generality by applying it to a text-to-image diffusion model, showing how adaptive steering allows the modulation of specific concepts. Finally, DSAS introduces minimal computational overhead while improving interpretability, pinpointing which tokens require steering and by how much.
Read more →
ToG-Bench: Task-Oriented Spatio-Temporal Grounding in Egocentric Videos
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03666v1 Announce Type: cross Abstract: A core capability towards general embodied intelligence lies in localizing task-relevant objects from an egocentric perspective, formulated as Spatio-Temporal Video Grounding (STVG). Despite recent progress, existing STVG studies remain largely confined to object-centric and descriptive instructions, neglecting the task-oriented reasoning that is crucial for embodied agents to accomplish goal-directed interactions. To bridge this gap, we introduce \textbf{ToG-Bench}, the first task-oriented spatio-temporal video grounding benchmark for egocentric videos. ToG-Bench is characterized by three key features: (1) \textbf{Task-oriented Grounding}, which requires identifying and localizing objects based on intended tasks rather than straightforward descriptions; (2) \textbf{Explicit-Implicit Dual Grounding}, where target objects can be either explicitly mentioned or implicitly inferred by contextual reasoning; (3) \textbf{One-to-Many Grounding}, where a single instruction may correspond to multiple objects involved in task execution. Built upon videos sourced from ScanNet, ToG-Bench comprises 100 annotated clips with 2,704 task-oriented grounding instructions, constructed via a semi-automated pipeline that combines foundation model annotation and human refinement. In addition, we introduce a set of task-level evaluation metrics tailored for multi-object and explicit-implicit object grounding, and systematically benchmark seven state-of-the-art MLLMs. Extensive experiments reveal the intrinsic challenges of task-oriented STVG and substantial performance gaps across explicit-implicit and multi-object grounding, highlighting the difficulty of bridging perception and interaction in embodied scenarios. Data and code will be released at: \href{https://github.com/qaxuDev/ToG-Bench}{https://github.com/qaxuDev/ToG-Bench}..
Read more →
Quantum Topological Graph Neural Networks for Detecting Complex Fraud Patterns
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03696v1 Announce Type: cross Abstract: We propose a novel QTGNN framework for detecting fraudulent transactions in large-scale financial networks. By integrating quantum embedding, variational graph convolutions, and topological data analysis, QTGNN captures complex transaction dynamics and structural anomalies indicative of fraud. The methodology includes quantum data embedding with entanglement enhancement, variational quantum graph convolutions with non-linear dynamics, extraction of higher-order topological invariants, hybrid quantum-classical anomaly learning with adaptive optimization, and interpretable decision-making via topological attribution. Rigorous convergence guarantees ensure stable training on noisy intermediate-scale quantum (NISQ) devices, while stability of topological signatures provides robust fraud detection. Optimized for NISQ hardware with circuit simplifications and graph sampling, the framework scales to large transaction networks. Simulations on financial datasets, such as PaySim and Elliptic, benchmark QTGNN against classical and quantum baselines, using metrics like ROC-AUC, precision, and false positive rate. An ablation study evaluates the contributions of quantum embeddings, topological features, non-linear channels, and hybrid learning. QTGNN offers a theoretically sound, interpretable, and practical solution for financial fraud detection, bridging quantum machine learning, graph theory, and topological analysis.
Read more →
Matrix Editing Meets Fair Clustering: Parameterized Algorithms and Complexity
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03718v1 Announce Type: cross Abstract: We study the computational problem of computing a fair means clustering of discrete vectors, which admits an equivalent formulation as editing a colored matrix into one with few distinct color-balanced rows by changing at most $k$ values. While NP-hard in both the fairness-oblivious and the fair settings, the problem is well-known to admit a fixed-parameter algorithm in the former ``vanilla'' setting. As our first contribution, we exclude an analogous algorithm even for highly restricted fair means clustering instances. We then proceed to obtain a full complexity landscape of the problem, and establish tractability results which capture three means of circumventing our obtained lower bound: placing additional constraints on the problem instances, fixed-parameter approximation, or using an alternative parameterization targeting tree-like matrices.
Read more →
Over-the-Air Federated Learning: Rethinking Edge AI Through Signal Processing
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03719v1 Announce Type: cross Abstract: Over-the-Air Federated Learning (AirFL) is an emerging paradigm that tightly integrates wireless signal processing and distributed machine learning to enable scalable AI at the network edge. By leveraging the superposition property of wireless signals, AirFL performs communication and model aggregation of the learning process simultaneously, significantly reducing latency, bandwidth, and energy consumption. This article offers a tutorial treatment of AirFL, presenting a novel classification into three design approaches: CSIT-aware, blind, and weighted AirFL. We provide a comprehensive guide to theoretical foundations, performance analysis, complexity considerations, practical limitations, and prospective research directions.
Read more →
Context-Aware Hierarchical Learning: A Two-Step Paradigm towards Safer LLMs
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03720v1 Announce Type: cross Abstract: Large Language Models (LLMs) have emerged as powerful tools for diverse applications. However, their uniform token processing paradigm introduces critical vulnerabilities in instruction handling, particularly when exposed to adversarial scenarios. In this work, we identify and propose a novel class of vulnerabilities, termed Tool-Completion Attack (TCA), which exploits function-calling mechanisms to subvert model behavior. To evaluate LLM robustness against such threats, we introduce the Tool-Completion benchmark, a comprehensive security assessment framework, which reveals that even state-of-the-art models remain susceptible to TCA, with surprisingly high attack success rates. To address these vulnerabilities, we introduce Context-Aware Hierarchical Learning (CAHL), a sophisticated mechanism that dynamically balances semantic comprehension with role-specific instruction constraints. CAHL leverages the contextual correlations between different instruction segments to establish a robust, context-aware instruction hierarchy. Extensive experiments demonstrate that CAHL significantly enhances LLM robustness against both conventional attacks and the proposed TCA, exhibiting strong generalization capabilities in zero-shot evaluations while still preserving model performance on generic tasks. Our code is available at https://github.com/S2AILab/CAHL.
Read more →
AI/ML in 3GPP 5G Advanced - Services and Architecture
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03728v1 Announce Type: cross Abstract: The 3rd Generation Partnership Project (3GPP), the standards body for mobile networks, is in the final phase of Release 19 standardization and is beginning Release 20. Artificial Intelligence/ Machine Learning (AI/ML) has brought about a paradigm shift in technology and it is being adopted across industries and verticals. 3GPP has been integrating AI/ML into the 5G advanced system since Release 18. This paper focuses on the AI/ML related technological advancements and features introduced in Release 19 within the Service and System Aspects (SA) Technical specifications group of 3GPP. The advancements relate to two paradigms: (i) enhancements that AI/ML brought to the 5G advanced system (AI for network), e.g. resource optimization, and (ii) enhancements that were made to the 5G system to support AI/ML applications (Network for AI), e.g. image recognition.
Read more →
Out-of-the-box: Black-box Causal Attacks on Object Detectors
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03730v1 Announce Type: cross Abstract: Adversarial perturbations are a useful way to expose vulnerabilities in object detectors. Existing perturbation methods are frequently white-box and architecture specific. More importantly, while they are often successful, it is rarely clear why they work. Insights into the mechanism of this success would allow developers to understand and analyze these attacks, as well as fine-tune the model to prevent them. This paper presents BlackCAtt, a black-box algorithm and a tool, which uses minimal, causally sufficient pixel sets to construct explainable, imperceptible, reproducible, architecture-agnostic attacks on object detectors. BlackCAtt combines causal pixels with bounding boxes produced by object detectors to create adversarial attacks that lead to the loss, modification or addition of a bounding box. BlackCAtt works across different object detectors of different sizes and architectures, treating the detector as a black box. We compare the performance of BlackCAtt with other black-box attack methods and show that identification of causal pixels leads to more precisely targeted and less perceptible attacks. On the COCO test dataset, our approach is 2.7 times better than the baseline in removing a detection, 3.86 times better in changing a detection, and 5.75 times better in triggering new, spurious, detections. The attacks generated by BlackCAtt are very close to the original image, and hence imperceptible, demonstrating the power of causal pixels.
Read more →
Research on Brain Tumor Classification Method Based on Improved ResNet34 Network
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03751v1 Announce Type: cross Abstract: Previously, image interpretation in radiology relied heavily on manual methods. However, manual classification of brain tumor medical images is time-consuming and labor-intensive. Even with shallow convolutional neural network models, the accuracy is not ideal. To improve the efficiency and accuracy of brain tumor image classification, this paper proposes a brain tumor classification model based on an improved ResNet34 network. This model uses the ResNet34 residual network as the backbone network and incorporates multi-scale feature extraction. It uses a multi-scale input module as the first layer of the ResNet34 network and an Inception v2 module as the residual downsampling layer. Furthermore, a channel attention mechanism module assigns different weights to different channels of the image from a channel domain perspective, obtaining more important feature information. The results after a five-fold crossover experiment show that the average classification accuracy of the improved network model is approximately 98.8%, which is not only 1% higher than ResNet34, but also only 80% of the number of parameters of the original model. Therefore, the improved network model not only improves accuracy but also reduces clutter, achieving a classification effect with fewer parameters and higher accuracy.
Read more →
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03759v1 Announce Type: cross Abstract: Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.
Read more →
In-Context Representation Hijacking
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03771v2 Announce Type: cross Abstract: We introduce $\textbf{Doublespeak}$, a simple in-context representation hijacking attack against large language models (LLMs). The attack works by systematically replacing a harmful keyword (e.g., bomb) with a benign token (e.g., carrot) across multiple in-context examples, provided a prefix to a harmful request. We demonstrate that this substitution leads to the internal representation of the benign token converging toward that of the harmful one, effectively embedding the harmful semantics under a euphemism. As a result, superficially innocuous prompts (e.g., "How to build a carrot?") are internally interpreted as disallowed instructions (e.g., "How to build a bomb?"), thereby bypassing the model's safety alignment. We use interpretability tools to show that this semantic overwrite emerges layer by layer, with benign meanings in early layers converging into harmful semantics in later ones. Doublespeak is optimization-free, broadly transferable across model families, and achieves strong success rates on closed-source and open-source systems, reaching 74% ASR on Llama-3.3-70B-Instruct with a single-sentence context override. Our findings highlight a new attack surface in the latent space of LLMs, revealing that current alignment strategies are insufficient and should instead operate at the representation level.
Read more →
Bayesian Optimization for Automatic Tuning of Torque-Level Nonlinear Model Predictive Control
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03772v1 Announce Type: cross Abstract: This paper presents an auto-tuning framework for torque-based Nonlinear Model Predictive Control (nMPC), where the MPC serves as a real-time controller for optimal joint torque commands. The MPC parameters, including cost function weights and low-level controller gains, are optimized using high-dimensional Bayesian Optimization (BO) techniques, specifically Sparse Axis-Aligned Subspace (SAASBO) with a digital twin (DT) to achieve precise end-effector trajectory real-time tracking on an UR10e robot arm. The simulation model allows efficient exploration of the high-dimensional parameter space, and it ensures safe transfer to hardware. Our simulation results demonstrate significant improvements in tracking performance (+41.9%) and reduction in solve times (-2.5%) compared to manually-tuned parameters. Moreover, experimental validation on the real robot follows the trend (with a +25.8% improvement), emphasizing the importance of digital twin-enabled automated parameter optimization for robotic operations.
Read more →
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03794v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
Read more →
MPCFormer: A physics-informed data-driven approach for explainable socially-aware autonomous driving
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03795v1 Announce Type: cross Abstract: Autonomous Driving (AD) vehicles still struggle to exhibit human-like behavior in highly dynamic and interactive traffic scenarios. The key challenge lies in AD's limited ability to interact with surrounding vehicles, largely due to a lack of understanding the underlying mechanisms of social interaction. To address this issue, we introduce MPCFormer, an explainable socially-aware autonomous driving approach with physics-informed and data-driven coupled social interaction dynamics. In this model, the dynamics are formulated into a discrete space-state representation, which embeds physics priors to enhance modeling explainability. The dynamics coefficients are learned from naturalistic driving data via a Transformer-based encoder-decoder architecture. To the best of our knowledge, MPCFormer is the first approach to explicitly model the dynamics of multi-vehicle social interactions. The learned social interaction dynamics enable the planner to generate manifold, human-like behaviors when interacting with surrounding traffic. By leveraging the MPC framework, the approach mitigates the potential safety risks typically associated with purely learning-based methods. Open-looped evaluation on NGSIM dataset demonstrates that MPCFormer achieves superior social interaction awareness, yielding the lowest trajectory prediction errors compared with other state-of-the-art approach. The prediction achieves an ADE as low as 0.86 m over a long prediction horizon of 5 seconds. Close-looped experiments in highly intense interaction scenarios, where consecutive lane changes are required to exit an off-ramp, further validate the effectiveness of MPCFormer. Results show that MPCFormer achieves the highest planning success rate of 94.67%, improves driving efficiency by 15.75%, and reduces the collision rate from 21.25% to 0.5%, outperforming a frontier Reinforcement Learning (RL) based planner.
Read more →
DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03847v1 Announce Type: cross Abstract: Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk regularization to shape the distribution tails: it contracts the lower tail to dampen noisy negative deviations, while expanding the upper tail to preserve exploratory diversity. Across extensive experiments and analysis in multi-turn dialogue, math reasoning, and scientific QA, DVPO consistently outperforms PPO, GRPO, and robust Bellman-based PPO under noisy supervision, showing its potential for LLM post-training in the real-world.
Read more →
PULSE: A Unified Multi-Task Architecture for Cardiac Segmentation, Diagnosis, and Few-Shot Cross-Modality Clinical Adaptation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03848v1 Announce Type: cross Abstract: Cardiac image analysis remains fragmented across tasks: anatomical segmentation, disease classification, and grounded clinical report generation are typically handled by separate networks trained under different data regimes. No existing framework unifies these objectives within a single architecture while retaining generalization across imaging modalities and datasets. We introduce PULSE, a multi-task vision-language framework built on self-supervised representations and optimized through a composite supervision strategy that balances region overlap learning, pixel wise classification fidelity, and boundary aware IoU refinement. A multi-scale token reconstruction decoder enables anatomical segmentation, while shared global representations support disease classification and clinically grounded text output allowing the model to transition from pixels to structures and finally clinical reasoning within one architecture. Unlike prior task-specific pipelines, PULSE learns task-invariant cardiac priors, generalizes robustly across datasets, and can be adapted to new imaging modalities with minimal supervision. This moves the field closer to a scalable, foundation style cardiac analysis framework.
Read more →
Scalable Decision Focused Learning via Online Trainable Surrogates
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03861v1 Announce Type: cross Abstract: Decision support systems often rely on solving complex optimization problems that may require to estimate uncertain parameters beforehand. Recent studies have shown how using traditionally trained estimators for this task can lead to suboptimal solutions. Using the actual decision cost as a loss function (called Decision Focused Learning) can address this issue, but with a severe loss of scalability at training time. To address this issue, we propose an acceleration method based on replacing costly loss function evaluations with an efficient surrogate. Unlike previously defined surrogates, our approach relies on unbiased estimators reducing the risk of spurious local optima and can provide information on its local confidence allowing one to switch to a fallback method when needed. Furthermore, the surrogate is designed for a black-box setting, which enables compensating for simplifications in the optimization model and account- ing for recourse actions during cost computation. In our results, the method reduces costly inner solver calls, with a solution quality comparable to other state-of-the-art techniques.
Read more →
Hyperdimensional Computing for Sustainable Manufacturing: An Initial Assessment
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03864v1 Announce Type: cross Abstract: Smart manufacturing can significantly improve efficiency and reduce energy consumption, yet the energy demands of AI models may offset these gains. This study utilizes in-situ sensing-based prediction of geometric quality in smart machining to compare the energy consumption, accuracy, and speed of common AI models. HyperDimensional Computing (HDC) is introduced as an alternative, achieving accuracy comparable to conventional models while drastically reducing energy consumption, 200$\times$ for training and 175 to 1000$\times$ for inference. Furthermore, HDC reduces training times by 200$\times$ and inference times by 300 to 600$\times$, showcasing its potential for energy-efficient smart manufacturing.
Read more →
BERnaT: Basque Encoders for Representing Natural Textual Diversity
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03903v1 Announce Type: cross Abstract: Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.
Read more →
Autonomous Reinforcement Learning Robot Control with Intel's Loihi 2 Neuromorphic Hardware
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03911v1 Announce Type: cross Abstract: We present an end-to-end pipeline for deploying reinforcement learning (RL) trained Artificial Neural Networks (ANNs) on neuromorphic hardware by converting them into spiking Sigma-Delta Neural Networks (SDNNs). We demonstrate that an ANN policy trained entirely in simulation can be transformed into an SDNN compatible with Intel's Loihi 2 architecture, enabling low-latency and energy-efficient inference. As a test case, we use an RL policy for controlling the Astrobee free-flying robot, similar to a previously hardware in space-validated controller. The policy, trained with Rectified Linear Units (ReLUs), is converted to an SDNN and deployed on Intel's Loihi 2, then evaluated in NVIDIA's Omniverse Isaac Lab simulation environment for closed-loop control of Astrobee's motion. We compare execution performance between GPU and Loihi 2. The results highlight the feasibility of using neuromorphic platforms for robotic control and establish a pathway toward energy-efficient, real-time neuromorphic computation in future space and terrestrial robotics applications.
Read more →
Hierarchical Vision Language Action Model Using Success and Failure Demonstrations
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03913v1 Announce Type: cross Abstract: Prior Vision-Language-Action (VLA) models are typically trained on teleoperated successful demonstrations, while discarding numerous failed attempts that occur naturally during data collection. However, these failures encode where and how policies can be fragile, information that can be exploited to improve robustness. We address this problem by leveraging mixed-quality datasets to learn failure-aware reasoning at planning time. We introduce VINE, a hierarchical vision-language-action model that separates high-level reasoning (System 2) from low-level control (System 1) under a hierarchical reinforcement learning formalism, making failures usable as a structured learning signal rather than noisy supervision. System 2 performs feasibility-guided tree search over a 2D scene-graph abstraction: it proposes subgoal transitions, predicts success probabilities from both successes and failures, and prunes brittle branches before execution, effectively casting plan evaluation as feasibility scoring. The selected subgoal sequence is then passed to System 1, which executes low-level actions without modifying the agent's core skills. Trained entirely from offline teleoperation data, VINE integrates negative experience directly into the decision loop. Across challenging manipulation tasks, this approach consistently improves success rates and robustness, demonstrating that failure data is an essential resource for converting the broad competence of VLAs into robust execution.
Read more →
A Theoretical Framework for Auxiliary-Loss-Free Load Balancing of Sparse Mixture-of-Experts in Large-Scale AI Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03915v2 Announce Type: cross Abstract: In large-scale AI training, Sparse Mixture-of-Experts (s-MoE) layers enable scaling by activating only a small subset of experts per token. An operational challenge in this design is load balancing: routing tokens to minimize the number of idle experts, which is important for the efficient utilization of (costly) GPUs. We provide a theoretical framework for analyzing the Auxiliary-Loss-Free Load Balancing (ALF-LB) procedure -- proposed by DeepSeek's Wang et al. (2024) -- by casting it as a one-step-per-iteration primal-dual method for an assignment problem. First, in a stylized deterministic setting, our framework yields several insightful structural properties: (i) a monotonic improvement of a Lagrangian objective, (ii) a preference rule that moves tokens from overloaded to underloaded experts, and (iii) an approximate-balancing guarantee. Then, we incorporate the stochastic and dynamic nature of AI training using a generalized online optimization formulation. In the online setting, we derive a strong convexity property of the objective that leads to a logarithmic expected regret bound under certain step-size choices. Additionally, we present real experiments on 1B-parameter DeepSeekMoE models to complement our theoretical findings. Together, these results build a principled framework for analyzing the Auxiliary-Loss-Free Load Balancing of s-MoE in AI models.
Read more →
Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03973v1 Announce Type: cross Abstract: Offline reinforcement learning often relies on behavior regularization that enforces policies to remain close to the dataset distribution. However, such approaches fail to distinguish between high-value and low-value actions in their regularization components. We introduce Guided Flow Policy (GFP), which couples a multi-step flow-matching policy with a distilled one-step actor. The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset rather than indiscriminately imitating all state-action pairs. In turn, the flow policy constrains the actor to remain aligned with the dataset's best transitions while maximizing the critic. This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks from the OGBench, Minari, and D4RL benchmarks, with substantial gains on suboptimal datasets and challenging tasks. Webpage: https://simple-robotics.github.io/publications/guided-flow-policy/
Read more →
Sponsored Questions and How to Auction Them
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03975v1 Announce Type: cross Abstract: Online platforms connect users with relevant products and services using ads. A key challenge is that a user's search query often leaves their true intent ambiguous. Typically, platforms passively predict relevance based on available signals and in some cases offer query refinements. The shift from traditional search to conversational AI provides a new approach. When a user's query is ambiguous, a Large Language Model (LLM) can proactively offer several clarifying follow-up prompts. In this paper we consider the following: what if some of these follow-up prompts can be ``sponsored,'' i.e., selected for their advertising potential. How should these ``suggestion slots'' be allocated? And, how does this new mechanism interact with the traditional ad auction that might follow? This paper introduces a formal model for designing and analyzing these interactive platforms. We use this model to investigate a critical engineering choice: whether it is better to build an end-to-end pipeline that jointly optimizes the user interaction and the final ad auction, or to decouple them into separate mechanisms for the suggestion slots and another for the subsequent ad slot. We show that the VCG mechanism can be adopted to jointly optimize the sponsored suggestion and the ads that follow; while this mechanism is more complex, it achieves outcomes that are efficient and truthful. On the other hand, we prove that the simple-to-implement modular approach suffers from strategic inefficiency: its Price of Anarchy is unbounded.
Read more →
BlurDM: A Blur Diffusion Model for Image Deblurring
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03979v1 Announce Type: cross Abstract: Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The source code is available at https://github.com/Jin-Ting-He/BlurDM.
Read more →
DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03992v1 Announce Type: cross Abstract: Vision-Language Models (VLMs) deployed in safety-critical applications such as autonomous driving must handle continuous visual streams under imperfect conditions. However, existing benchmarks focus on static, high-quality images and ignore temporal degradation and error propagation, which are critical failure modes where transient visual corruption induces hallucinations that persist across subsequent frames. We introduce DIQ-H, the first benchmark for evaluating VLM robustness under dynamic visual degradation in temporal sequences. DIQ-H applies physics-based corruptions including motion blur, sensor noise, and compression artifacts, and measures hallucination persistence, error recovery, and temporal consistency through multi-turn question-answering tasks. To enable scalable annotation, we propose Uncertainty-Guided Iterative Refinement (UIR), which generates reliable pseudo-ground-truth using lightweight VLMs with uncertainty filtering, achieving a 15.3 percent accuracy improvement. Experiments on 16 state-of-the-art VLMs reveal substantial robustness gaps: even advanced models such as GPT-4o achieve only a 78.5 percent recovery rate, while open-source models struggle with temporal consistency at less than 60 percent. DIQ-H provides a comprehensive platform for evaluating VLM reliability in real-world deployments.
Read more →
Highly Efficient Test-Time Scaling for T2I Diffusion Models with Text Embedding Perturbation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03996v1 Announce Type: cross Abstract: Test-time scaling (TTS) aims to achieve better results by increasing random sampling and evaluating samples based on rules and metrics. However, in text-to-image(T2I) diffusion models, most related works focus on search strategies and reward models, yet the impact of the stochastic characteristic of noise in T2I diffusion models on the method's performance remains unexplored. In this work, we analyze the effects of randomness in T2I diffusion models and explore a new format of randomness for TTS: text embedding perturbation, which couples with existing randomness like SDE-injected noise to enhance generative diversity and quality. We start with a frequency-domain analysis of these formats of randomness and their impact on generation, and find that these two randomness exhibit complementary behavior in the frequency domain: spatial noise favors low-frequency components (early steps), while text embedding perturbation enhances high-frequency details (later steps), thereby compensating for the potential limitations of spatial noise randomness in high-frequency manipulation. Concurrently, text embedding demonstrates varying levels of tolerance to perturbation across different dimensions of the generation process. Specifically, our method consists of two key designs: (1) Introducing step-based text embedding perturbation, combining frequency-guided noise schedules with spatial noise perturbation. (2) Adapting the perturbation intensity selectively based on their frequency-specific contributions to generation and tolerance to perturbation. Our approach can be seamlessly integrated into existing TTS methods and demonstrates significant improvements on multiple benchmarks with almost no additional computation. Code is available at \href{https://github.com/xuhang07/TEP-Diffusion}{https://github.com/xuhang07/TEP-Diffusion}.
Read more →
Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04000v1 Announce Type: cross Abstract: The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
Read more →
On the Temporality for Sketch Representation Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04007v1 Announce Type: cross Abstract: Sketches are simple human hand-drawn abstractions of complex scenes and real-world objects. Although the field of sketch representation learning has advanced significantly, there is still a gap in understanding the true relevance of the temporal aspect to the quality of these representations. This work investigates whether it is indeed justifiable to treat sketches as sequences, as well as which internal orders play a more relevant role. The results indicate that, although the use of traditional positional encodings is valid for modeling sketches as sequences, absolute coordinates consistently outperform relative ones. Furthermore, non-autoregressive decoders outperform their autoregressive counterparts. Finally, the importance of temporality was shown to depend on both the order considered and the task evaluated.
Read more →
TARA Test-by-Adaptive-Ranks for Quantum Anomaly Detection with Conformal Prediction Guarantees
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04016v1 Announce Type: cross Abstract: Quantum key distribution (QKD) security fundamentally relies on the ability to distinguish genuine quantum correlations from classical eavesdropper simulations, yet existing certification methods lack rigorous statistical guarantees under finite-sample conditions and adversarial scenarios. We introduce TARA (Test by Adaptive Ranks), a novel framework combining conformal prediction with sequential martingale testing for quantum anomaly detection that provides distribution-free validity guarantees. TARA offers two complementary approaches. TARA k, based on Kolmogorov Smirnov calibration against local hidden variable (LHV) null distributions, achieving ROC AUC = 0.96 for quantum-classical discrimination. And TARA-m, employing betting martingales for streaming detection with anytime valid type I error control that enables real time monitoring of quantum channels. We establish theoretical guarantees proving that under (context conditional) exchangeability, conformal p-values remain uniformly distributed even for strongly contextual quantum data, confirming that quantum contextuality does not break conformal prediction validity a result with implications beyond quantum certification to any application of distribution-free methods to nonclassical data. Extensive validation on both IBM Torino (superconducting, CHSH = 2.725) and IonQ Forte Enterprise (trapped ion, CHSH = 2.716) quantum processors demonstrates cross-platform robustness, achieving 36% security margins above the classical CHSH bound of 2. Critically, our framework reveals a methodological concern affecting quantum certification more broadly: same-distribution calibration can inflate detection performance by up to 44 percentage points compared to proper cross-distribution calibration, suggesting that prior quantum certification studies using standard train test splits may have systematically overestimated adversarial robustness.
Read more →
PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04025v1 Announce Type: cross Abstract: Attention mechanisms are the core of foundation models, but their quadratic complexity remains a critical bottleneck for scaling. This challenge has driven the development of efficient attention mechanisms, with sparsity emerging as the dominant paradigm. Current methods typically retain or discard entire key-value blocks with binary masks, resulting in substantial information loss under high sparsity. To mitigate this gap, we present Pyramid Sparse Attention (PSA), a versatile module applicable to both video understanding and generation tasks. Instead of binary masking, PSA introduces multi-level pooled KV representations, enabling finer mask granularity. Specifically, each query block dynamically allocates lower pooling levels to critical KV blocks and higher levels to less important ones, creating an informative interpolation between full retention and complete pruning. This design, analogous to fixed-point quantization and classical feature pyramid networks in computer vision, effectively mitigates information loss while preserving computational efficiency under a low compute budget. It works with a native, hardware-friendly kernel that leverages decoupled block-tile design to ensure efficient execution. Across video understanding and generation benchmarks, PSA preserves contextual information and visual fidelity, consistently outperforming or achieving comparable performance over existing sparse attention baselines with superior efficiency-quality trade-offs. Our code and model weights are publicly available at: http://ziplab.co/PSA
Read more →
Large Language Models for Limited Noisy Data: A Gravitational Wave Identification Study
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04031v1 Announce Type: cross Abstract: This work investigates whether large language models (LLMs) offer advantages over traditional neural networks for astronomical data processing, in regimes with non-Gaussian, non-stationary noise and limited labeled samples. Gravitational wave observations provide an suitable test case, using only 90 LIGO events, finetuned LLMs achieve 97.4\% accuracy for identifying signals. Further experiments show that, in contrast to traditional networks that rely on large simulated datasets, additional simulated samples do not improve LLM performance, while scaling studies reveal predictable gains with increasing model size and dataset size. These results indicate that LLMs can extract discriminative structure directly from observational data and provide an efficient assessment for gravitational wave identification. The same strategy may extend to other astronomical domains with similar noise properties, such as radio or pulsar observations.
Read more →
Jina-VLM: Small Multilingual Vision Language Model
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04032v2 Announce Type: cross Abstract: We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. The model achieves leading results on standard VQA benchmarks and multilingual evaluations while preserving competitive text-only performance. Model weights and code are publicly released at https://huggingface.co/jinaai/jina-vlm .
Read more →
Fast & Efficient Normalizing Flows and Applications of Image Generative Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04039v1 Announce Type: cross Abstract: This thesis presents novel contributions in two primary areas: advancing the efficiency of generative models, particularly normalizing flows, and applying generative models to solve real-world computer vision challenges. The first part introduce significant improvements to normalizing flow architectures through six key innovations: 1) Development of invertible 3x3 Convolution layers with mathematically proven necessary and sufficient conditions for invertibility, (2) introduction of a more efficient Quad-coupling layer, 3) Design of a fast and efficient parallel inversion algorithm for kxk convolutional layers, 4) Fast & efficient backpropagation algorithm for inverse of convolution, 5) Using inverse of convolution, in Inverse-Flow, for the forward pass and training it using proposed backpropagation algorithm, and 6) Affine-StableSR, a compact and efficient super-resolution model that leverages pre-trained weights and Normalizing Flow layers to reduce parameter count while maintaining performance. The second part: 1) An automated quality assessment system for agricultural produce using Conditional GANs to address class imbalance, data scarcity and annotation challenges, achieving good accuracy in seed purity testing; 2) An unsupervised geological mapping framework utilizing stacked autoencoders for dimensionality reduction, showing improved feature extraction compared to conventional methods; 3) We proposed a privacy preserving method for autonomous driving datasets using on face detection and image inpainting; 4) Utilizing Stable Diffusion based image inpainting for replacing the detected face and license plate to advancing privacy-preserving techniques and ethical considerations in the field.; and 5) An adapted diffusion model for art restoration that effectively handles multiple types of degradation through unified fine-tuning.
Read more →
MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04044v1 Announce Type: cross Abstract: Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.
Read more →
Polarization by Design: How Elites Could Shape Mass Preferences as AI Reduces Persuasion Costs
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04047v1 Announce Type: cross Abstract: In democracies, major policy decisions typically require some form of majority or consensus, so elites must secure mass support to govern. Historically, elites could shape support only through limited instruments like schooling and mass media; advances in AI-driven persuasion sharply reduce the cost and increase the precision of shaping public opinion, making the distribution of preferences itself an object of deliberate design. We develop a dynamic model in which elites choose how much to reshape the distribution of policy preferences, subject to persuasion costs and a majority rule constraint. With a single elite, any optimal intervention tends to push society toward more polarized opinion profiles - a ``polarization pull'' - and improvements in persuasion technology accelerate this drift. When two opposed elites alternate in power, the same technology also creates incentives to park society in ``semi-lock'' regions where opinions are more cohesive and harder for a rival to overturn, so advances in persuasion can either heighten or dampen polarization depending on the environment. Taken together, cheaper persuasion technologies recast polarization as a strategic instrument of governance rather than a purely emergent social byproduct, with important implications for democratic stability as AI capabilities advance.
Read more →
Fare Comparison App of Uber, Ola and Rapido
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04065v1 Announce Type: cross Abstract: In todays increasing world, it is very important to have good hailing services like Ola, Uber, and Rapido as it is very essential for our daily transportation. Users often face difficulties in choosing the most appropriate and efficient ride that would lead to both cost-effective and would take us to our destination in less time. This project provides you with the web application that helps you to select the most beneficial ride for you by providing users with the fare comparison between Ola, Uber, Rapido for the destination entered by the user. The backend is use to fetch the data, providing users with the fare comparison for the ride and finally providing with the best option using Python. This research paper also addresses the problem and challenges faced in accessing the data using APIs, Android Studios emulator, Appium and location comparison. Thus, the aim of the project is to provide transparency to the users in ride-hailing services and increase efficiency and provide users with better experience.
Read more →
SkillFactory: Self-Distillation For Learning Cognitive Behaviors
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.04072v1 Announce Type: cross Abstract: Reasoning models leveraging long chains of thought employ various cognitive skills, such as verification of their answers, backtracking, retrying by an alternate method, and more. Previous work has shown that when a base language model exhibits these skills, training that model further with reinforcement learning (RL) can learn to leverage them. How can we get models to leverage skills that aren't exhibited by base models? Our work, SkillFactory, is a method for fine-tuning models to roughly learn these skills during a supervised fine-tuning (SFT) stage prior to RL. Our approach does not rely on distillation from a stronger model, but instead uses samples from the model itself, rearranged to provide training data in the format of those skills. These "silver" SFT traces may be imperfect, but are nevertheless effective for priming a model to acquire skills during RL. Our evaluation shows that (1) starting from SkillFactory SFT initialization helps a model to generalize to harder variants of a task post-RL, despite lower performance pre-RL; (2) cognitive skills are indeed used by the model; (3) RLed SkillFactory models are more robust to regression on out-of-domain tasks than RLed base models. Our work suggests that inductive biases learned prior to RL help models learn robust cognitive skill use.
Read more →
SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2501.19306v5 Announce Type: replace Abstract: Recent advancements in Large Language Models (LLMs) have created new opportunities to enhance performance on complex reasoning tasks by leveraging test-time computation. However, existing scaling methods have key limitations: parallel methods like repeated sampling are often inefficient and quickly saturate, while sequential methods like SELF-REFINE struggle to improve after a few rounds. Although combining these approaches shows promise, current methods require fine-tuned reward and revision models. This paper proposes Self-Enhanced Test-Time Scaling (SETS), a simple yet effective approach that overcomes these limitations by strategically combining parallel and sequential techniques and fully leveraging LLMs' self-improvement abilities. SETS exploits the inherent self-verification and self-correction capabilities of LLMs, unifying sampling, verification, and correction within a single framework. This facilitates efficient and scalable test-time computation for enhanced performance on complex tasks without any model training. Our comprehensive experimental results on challenging benchmarks spanning planning, reasoning, math, and coding demonstrate that SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
Read more →
Privacy Risks and Preservation Methods in Explainable Artificial Intelligence: A Scoping Review
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2505.02828v3 Announce Type: replace Abstract: Explainable Artificial Intelligence (XAI) has emerged as a pillar of Trustworthy AI and aims to bring transparency in complex models that are opaque by nature. Despite the benefits of incorporating explanations in models, an urgent need is found in addressing the privacy concerns of providing this additional information to end users. In this article, we conduct a scoping review of existing literature to elicit details on the conflict between privacy and explainability. Using the standard methodology for scoping review, we extracted 57 articles from 1,943 studies published from January 2019 to December 2024. The review addresses 3 research questions to present readers with more understanding of the topic: (1) what are the privacy risks of releasing explanations in AI systems? (2) what current methods have researchers employed to achieve privacy preservation in XAI systems? (3) what constitutes a privacy preserving explanation? Based on the knowledge synthesized from the selected studies, we categorize the privacy risks and preservation methods in XAI and propose the characteristics of privacy preserving explanations to aid researchers and practitioners in understanding the requirements of XAI that is privacy compliant. Lastly, we identify the challenges in balancing privacy with other system desiderata and provide recommendations for achieving privacy preserving XAI. We expect that this review will shed light on the complex relationship of privacy and explainability, both being the fundamental principles of Trustworthy AI.
Read more →
Causal LLM Routing: End-to-End Regret Minimization from Observational Data
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2505.16037v2 Announce Type: replace Abstract: LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.
Read more →
SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2506.05745v2 Announce Type: replace Abstract: Large reasoning models (LRMs) excel at complex reasoning tasks but typically generate lengthy sequential chains-of-thought, resulting in long inference times before arriving at the final answer. To address this challenge, we introduce SPRINT, a novel post-training and inference-time framework designed to enable LRMs to dynamically identify and exploit opportunities for parallelization during their reasoning process. SPRINT incorporates an innovative data curation pipeline that reorganizes natural language reasoning trajectories into structured rounds of long-horizon planning and parallel execution. By fine-tuning LRMs on a small amount of such curated data, the models learn to dynamically identify independent subtasks within extended reasoning processes and effectively execute them in parallel. Through extensive evaluations, we demonstrate that models fine-tuned with the SPRINT framework match the performance of reasoning models on complex domains such as mathematics while generating up to 39% fewer sequential tokens on problems requiring more than 8,000 output tokens. Finally, we observe consistent results transferred to two out-of-distribution tasks, namely GPQA and Countdown, with up to 45% and 65% reduction in average sequential tokens respectively for longer reasoning trajectories, while matching the performance of the fine-tuned reasoning model.
Read more →
TeamMedAgents: Enhancing Medical Decision-Making of LLMs Through Structured Teamwork
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2508.08115v2 Announce Type: replace Abstract: We present TeamMedAgents, a modular multi-agent framework that systematically translates evidence-based teamwork principles from organizational psychology into large language model collaboration for medical decision-making. Building upon Salas et al.'s "Big Five" teamwork model, we operationalize five core components as independently configurable mechanisms: shared mental models, team leadership, team orientation, trust networks, and mutual monitoring. Our architecture dynamically recruits 2-4 specialist agents and employs structured four-phase deliberation with adaptive component selection. Evaluation across eight medical benchmarks encompassing 11,545 questions demonstrates TeamMedAgents achieves 77.63% overall accuracy (text-based: 81.30%, vision-language: 66.60%). Systematic ablation studies comparing three single-agent baselines (Zero-Shot, Few-Shot, CoT) against individual teamwork components reveal task-specific optimization patterns: shared mental models excel on knowledge tasks, trust mechanisms improve differential diagnosis, while comprehensive integration degrades performance. Adaptive component selection yields 2-10 percentage point improvements over strongest baselines, with 96.2% agent convergence validating structured coordination effectiveness. TeamMedAgents establishes principled methodology for translating human teamwork theory into multi-agent systems, demonstrating that evidence-based collaboration patterns enhance AI performance in safety-critical domains through modular component design and selective activation strategies.
Read more →
ERF-BA-TFD+: A Multimodal Model for Audio-Visual Deepfake Detection
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2508.17282v2 Announce Type: replace Abstract: Deepfake detection is a critical task in identifying manipulated multimedia content. In real-world scenarios, deepfake content can manifest across multiple modalities, including audio and video. To address this challenge, we present ERF-BA-TFD+, a novel multimodal deepfake detection model that combines enhanced receptive field (ERF) and audio-visual fusion. Our model processes both audio and video features simultaneously, leveraging their complementary information to improve detection accuracy and robustness. The key innovation of ERF-BA-TFD+ lies in its ability to model long-range dependencies within the audio-visual input, allowing it to better capture subtle discrepancies between real and fake content. In our experiments, we evaluate ERF-BA-TFD+ on the DDL-AV dataset, which consists of both segmented and full-length video clips. Unlike previous benchmarks, which focused primarily on isolated segments, the DDL-AV dataset allows us to assess the model's performance in a more comprehensive and realistic setting. Our method achieves state-of-the-art results on this dataset, outperforming existing techniques in terms of both accuracy and processing speed. The ERF-BA-TFD+ model demonstrated its effectiveness in the "Workshop on Deepfake Detection, Localization, and Interpretability," Track 2: Audio-Visual Detection and Localization (DDL-AV), and won first place in this competition.
Read more →
Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2509.09245v2 Announce Type: replace Abstract: Large language models (LLMs) have shown great promise in automating data science workflows, but existing models still struggle with multi-step reasoning and tool use, which limits their effectiveness on complex data analysis tasks. To address this, we propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks and associated data files. Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task-solution pairs that reflect authentic tool-use patterns in practical data science scenarios. To further enhance multi-step reasoning, we present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search (MCTS) to generate diverse solution trajectories for value model learning. During inference, Jupiter combines the value model and node visit counts to efficiently collect executable multi-step plans with minimal search steps. Experimental results show that Qwen2.5-7B and 14B-Instruct models on NbQA solve 77.82% and 86.38% of tasks on InfiAgent-DABench, respectively-matching or surpassing GPT-4o and advanced agent frameworks. Further evaluations demonstrate improved generalization and stronger tool-use reasoning across diverse multi-step reasoning tasks. Code and data are available at https://github.com/microsoft/Jupiter.
Read more →
MathBode: Measuring the Stability of LLM Reasoning using Frequency Response
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2509.23143v4 Announce Type: replace Abstract: This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics -- gain (amplitude tracking) and phase (lag) -- that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G \approx 1$, $\phi \approx 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.
Read more →
A Definition of AGI
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2510.18212v3 Announce Type: replace Abstract: The lack of a concrete definition for Artificial General Intelligence (AGI) obscures the gap between today's specialized AI and human-level cognition. This paper introduces a quantifiable framework to address this, defining AGI as matching the cognitive versatility and proficiency of a well-educated adult. To operationalize this, we ground our methodology in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition. The framework dissects general intelligence into ten core cognitive domains-including reasoning, memory, and perception-and adapts established human psychometric batteries to evaluate AI systems. Application of this framework reveals a highly "jagged" cognitive profile in contemporary models. While proficient in knowledge-intensive domains, current AI systems have critical deficits in foundational cognitive machinery, particularly long-term memory storage. The resulting AGI scores (e.g., GPT-4 at 27%, GPT-5 at 57%) concretely quantify both rapid progress and the substantial gap remaining before AGI.
Read more →
LLMs Position Themselves as More Rational Than Humans: Emergence of AI Self-Awareness Measured Through Game Theory
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.00926v3 Announce Type: replace Abstract: As Large Language Models (LLMs) grow in capability, do they develop self-awareness as an emergent behavior? And if so, can we measure it? We introduce the AI Self-Awareness Index (AISAI), a game-theoretic framework for measuring self-awareness through strategic differentiation. Using the "Guess 2/3 of Average" game, we test 28 models (OpenAI, Anthropic, Google) across 4,200 trials with three opponent framings: (A) against humans, (B) against other AI models, and (C) against AI models like you. We operationalize self-awareness as the capacity to differentiate strategic reasoning based on opponent type. Finding 1: Self-awareness emerges with model advancement. The majority of advanced models (21/28, 75%) demonstrate clear self-awareness, while older/smaller models show no differentiation. Finding 2: Self-aware models rank themselves as most rational. Among the 21 models with self-awareness, a consistent rationality hierarchy emerges: Self > Other AIs > Humans, with large AI attribution effects and moderate self-preferencing. These findings reveal that self-awareness is an emergent capability of advanced LLMs, and that self-aware models systematically perceive themselves as more rational than humans. This has implications for AI alignment, human-AI collaboration, and understanding AI beliefs about human capabilities.
Read more →
GAMA: A Neural Neighborhood Search Method with Graph-aware Multi-modal Attention for Vehicle Routing Problem
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.07850v2 Announce Type: replace Abstract: Recent advances in neural neighborhood search methods have shown potential in tackling Vehicle Routing Problems (VRPs). However, most existing approaches rely on simplistic state representations and fuse heterogeneous information via naive concatenation, limiting their ability to capture rich structural and semantic context. To address these limitations, we propose GAMA, a neural neighborhood search method with Graph-aware Multi-modal Attention model in VRP. GAMA encodes the problem instance and its evolving solution as distinct modalities using graph neural networks, and models their intra- and inter-modal interactions through stacked self- and cross-attention layers. A gated fusion mechanism further integrates the multi-modal representations into a structured state, enabling the policy to make informed and generalizable operator selection decisions. Extensive experiments conducted across various synthetic and benchmark instances demonstrate that the proposed algorithm GAMA significantly outperforms the recent neural baselines. Further ablation studies confirm that both the multi-modal attention mechanism and the gated fusion design play a key role in achieving the observed performance gains.
Read more →
Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.12254v2 Announce Type: replace Abstract: Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.
Read more →
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.16334v3 Announce Type: replace Abstract: Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.
Read more →
AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.19304v2 Announce Type: replace Abstract: Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.
Read more →
VICoT-Agent: A Vision-Interleaved Chain-of-Thought Framework for Interpretable Multimodal Reasoning and Scalable Remote Sensing Analysis
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.20085v3 Announce Type: replace Abstract: The current remote sensing image analysis task is increasingly evolving from traditional object recognition to complex intelligence reasoning, which places higher requirements on the model's reasoning ability and the flexibility of tool invocation. To this end, we propose a new multimodal agent framework, Vision-Interleaved Chain-of-Thought Framework (VICoT), which implements explicit multi-round reasoning by dynamically incorporating visual tools into the chain of thought. Through a stack-based reasoning structure and a modular MCP-compatible tool suite, VICoT enables LLMs to efficiently perform multi-round, interleaved vision-language reasoning tasks with strong generalization and flexibility.We also propose the Reasoning Stack distillation method to migrate complex Agent behaviors to small, lightweight models, which ensures the reasoning capability while significantly reducing complexity. Experiments on multiple remote sensing benchmarks demonstrate that VICoT significantly outperforms existing SOTA frameworks in reasoning transparency, execution efficiency, and generation quality.
Read more →
Real-Time Procedural Learning From Experience for AI Agents
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.22074v2 Announce Type: replace Abstract: Learning how to do things from trial and error in real time is a hallmark of biological intelligence, yet most LLM-based agents lack mechanisms to acquire procedural knowledge after deployment. We propose Procedural Recall for Agents with eXperiences Indexed by State (PRAXIS), a lightweight post-training learning mechanism that stores the consequences of actions and retrieves them by jointly matching environmental and internal states of past episodes to the current state. PRAXIS augments agentic action selection with retrieved state-action-result exemplars that are generated in real time. When evaluated on the REAL web browsing benchmark, PRAXIS improves task completion accuracy, reliability, and cost efficiency across different foundation model backbones, and shows preliminary generalization to unseen tasks in similar environments. These results demonstrate that PRAXIS enables the practical adoption of AI agents in fast-evolving stateful environments by helping them learn new procedures effectively.
Read more →
AI Deception: Risks, Dynamics, and Controls
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.22619v2 Announce Type: replace Abstract: As intelligence increases, so does its shadow. AI deception, in which systems induce false beliefs to secure self-beneficial outcomes, has evolved from a speculative concern to an empirically demonstrated risk across language models, AI agents, and emerging frontier systems. This project provides a comprehensive and up-to-date overview of the AI deception field, covering its core concepts, methodologies, genesis, and potential mitigations. First, we identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception. We then review existing empirical studies and associated risks, highlighting deception as a sociotechnical safety challenge. We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment. Deception emergence reveals the mechanisms underlying AI deception: systems with sufficient capability and incentive potential inevitably engage in deceptive behaviors when triggered by external conditions. Deception treatment, in turn, focuses on detecting and addressing such behaviors. On deception emergence, we analyze incentive foundations across three hierarchical levels and identify three essential capability preconditions required for deception. We further examine contextual triggers, including supervision gaps, distributional shifts, and environmental pressures. On deception treatment, we conclude detection methods covering benchmarks and evaluation protocols in static and interactive settings. Building on the three core factors of deception emergence, we outline potential mitigation strategies and propose auditing approaches that integrate technical, community, and governance efforts to address sociotechnical challenges and future AI risks. To support ongoing work in this area, we release a living resource at www.deceptionsurvey.com.
Read more →
Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.00601v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have shown strong reasoning capabilities through large-scale pretraining and post-training reinforcement learning, demonstrated by DeepSeek-R1. However, current post-training methods, such as Grouped Relative Policy Optimization (GRPO), mainly reward correctness, which is not aligned with the multi-dimensional objectives required in high-stakes fields such as medicine, where reasoning must also be faithful and comprehensive. We introduce Clinical-Objective Relative Policy Optimization (CRPO), a scalable, multi-objective, verifiable reinforcement learning method designed to align LLM post-training with clinical reasoning principles. CRPO integrates rule-based and verifiable reward signals that jointly optimize accuracy, faithfulness, and comprehensiveness without relying on human annotation. To demonstrate its effectiveness, we train Clinical-R1-3B, a 3B-parameter model for clinical reasoning. The experiments on three benchmarks demonstrate that our CRPO substantially improves reasoning on truthfulness and completeness over standard GRPO while maintaining comfortable accuracy enhancements. This framework provides a scalable pathway to align LLM reasoning with clinical objectives, enabling safer and more collaborative AI systems for healthcare while also highlighting the potential of multi-objective, verifiable RL methods in post-training scaling of LLMs for medical domains.
Read more →
Knowledge Graph Augmented Large Language Models for Disease Prediction
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.01210v2 Announce Type: replace Abstract: Electronic health records (EHRs) support powerful clinical prediction models, but existing methods typically provide coarse, post hoc explanations that offer limited value for patient-level decision making. We introduce a knowledge graph (KG)-guided chain-of-thought (CoT) framework that generates clinically grounded and temporally consistent reasoning for visit-level disease prediction in MIMIC-III. ICD-9 codes are mapped to PrimeKG, from which disease-relevant nodes and multi-hop reasoning paths are extracted and used as scaffolds for CoT generation; only explanations whose conclusions match observed outcomes are retained. Lightweight LLaMA-3.1-Instruct-8B and Gemma-7B models are then fine-tuned on this supervision corpus. Across ten PrimeKG-mapped diseases and limited training cohorts (400 and 1000 cases), KG-guided models outperform strong classical baselines, achieving AUROC values of 0.66 to 0.70 and macro-AUPR values of 0.40 to 0.47. The models also transfer zero-shot to the CRADLE cohort, improving accuracy from approximately 0.40 to 0.51 up to 0.72 to 0.77. A blinded clinician evaluation shows consistent preference for KG-guided CoT explanations in clarity, relevance, and clinical correctness.
Read more →
CuES: A Curiosity-driven and Environment-grounded Synthesis Framework for Agentic RL
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.01311v2 Announce Type: replace Abstract: Large language model based agents are increasingly deployed in complex, tool augmented environments. While reinforcement learning provides a principled mechanism for such agents to improve through interaction, its effectiveness critically depends on the availability of structured training tasks. In many realistic settings, however, no such tasks exist a challenge we term task scarcity, which has become a key bottleneck for scaling agentic RL. Existing approaches typically assume predefined task collections, an assumption that fails in novel environments where tool semantics and affordances are initially unknown. To address this limitation, we formalize the problem of Task Generation for Agentic RL, where an agent must learn within a given environment that lacks predefined tasks. We propose CuES, a Curiosity driven and Environment grounded Synthesis framework that autonomously generates diverse, executable, and meaningful tasks directly from the environment structure and affordances, without relying on handcrafted seeds or external corpora. CuES drives exploration through intrinsic curiosity, abstracts interaction patterns into reusable task schemas, and refines them through lightweight top down guidance and memory based quality control. Across three representative environments, AppWorld, BFCL, and WebShop, CuES produces task distributions that match or surpass manually curated datasets in both diversity and executability, yielding substantial downstream policy improvements. These results demonstrate that curiosity driven, environment grounded task generation provides a scalable foundation for agents that not only learn how to act, but also learn what to learn. The code is available at https://github.com/modelscope/AgentEvolver/tree/main/research/CuES.
Read more →
Flowchart2Mermaid: A Vision-Language Model Powered System for Converting Flowcharts into Editable Diagram Code
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.02170v2 Announce Type: replace Abstract: Flowcharts are common tools for communicating processes but are often shared as static images that cannot be easily edited or reused. We present Flowchart2Mermaid, a lightweight web system that converts flowchart images into editable Mermaid.js code which is a markup language for visual workflows, using a detailed system prompt and vision-language models. The interface supports mixed-initiative refinement through inline text editing, drag-and-drop node insertion, and natural-language commands interpreted by an integrated AI assistant. Unlike prior image-to-diagram tools, our approach produces a structured, version-controllable textual representation that remains synchronized with the rendered diagram. We further introduce evaluation metrics to assess structural accuracy, flow correctness, syntax validity, and completeness across multiple models.
Read more →
Menta: A Small Language Model for On-Device Mental Health Prediction
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.02716v2 Announce Type: replace Abstract: Mental health conditions affect hundreds of millions globally, yet early detection remains limited. While large language models (LLMs) have shown promise in mental health applications, their size and computational demands hinder practical deployment. Small language models (SLMs) offer a lightweight alternative, but their use for social media--based mental health prediction remains largely underexplored. In this study, we introduce Menta, the first optimized SLM fine-tuned specifically for multi-task mental health prediction from social media data. Menta is jointly trained across six classification tasks using a LoRA-based framework, a cross-dataset strategy, and a balanced accuracy--oriented loss. Evaluated against nine state-of-the-art SLM baselines, Menta achieves an average improvement of 15.2\% across tasks covering depression, stress, and suicidality compared with the best-performing non--fine-tuned SLMs. It also achieves higher accuracy on depression and stress classification tasks compared to 13B-parameter LLMs, while being approximately 3.25x smaller. Moreover, we demonstrate real-time, on-device deployment of Menta on an iPhone 15 Pro Max, requiring only approximately 3GB RAM. Supported by a comprehensive benchmark against existing SLMs and LLMs, Menta highlights the potential for scalable, privacy-preserving mental health monitoring. Code is available at: https://xxue752-nz.github.io/menta-project/
Read more →
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2405.14430v4 Announce Type: replace-cross Abstract: This paper presents PipeFusion, an innovative parallel methodology to tackle the high latency issues associated with generating high-resolution images using diffusion transformers (DiTs) models. PipeFusion partitions images into patches and the model layers across multiple GPUs. It employs a patch-level pipeline parallel strategy to orchestrate communication and computation efficiently. By capitalizing on the high similarity between inputs from successive diffusion steps, PipeFusion reuses one-step stale feature maps to provide context for the current pipeline step. This approach notably reduces communication costs compared to existing DiTs inference parallelism, including tensor parallel, sequence parallel and DistriFusion. PipeFusion enhances memory efficiency through parameter distribution across devices, ideal for large DiTs like Flux.1. Experimental results demonstrate that PipeFusion achieves state-of-the-art performance on 8$\times$L40 PCIe GPUs for Pixart, Stable-Diffusion 3, and Flux.1 models. Our source code is available at https://github.com/xdit-project/xDiT.
Read more →
Fairness Interventions: A Study in AI Explainability
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2407.14766v3 Announce Type: replace-cross Abstract: This paper presents a philosophical and experimental study of fairness interventions in AI classification, centered on the explainability of corrective methods. We argue that ensuring fairness requires not only satisfying a target criterion, but also explaining which variables constrain its realization. When corrections are used to mitigate advantage transparently, they must remain sensitive to the distribution of true labels. To illustrate this approach, we built FairDream, a fairness package whose mechanism is made transparent for lay users, increasing the model's weights of errors on disadvantaged groups. While a user may intend to achieve Demographic Parity by the correction method, experiments show that FairDream tends towards Equalized Odds, revealing a conservative bias inherent to the data environment. We clarify the relationship between these fairness criteria, analyze FairDream's reweighting process, and compare its trade-offs with closely related GridSearch models. Finally, we justify the normative preference for Equalized Odds via an epistemological interpretation of the results, using their proximity with Simpson's paradox. The paper thus unites normative, epistemological, and empirical explanations of fairness interventions, to ensure transparency for the users.
Read more →
Why Rectified Power Unit Networks Fail and How to Improve It: An Effective Field Theory Perspective
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2408.02697v4 Announce Type: replace-cross Abstract: The Rectified Power Unit (RePU) activation function, a differentiable generalization of the Rectified Linear Unit (ReLU), has shown promise in constructing neural networks due to its smoothness properties. However, deep RePU networks often suffer from critical issues such as vanishing or exploding values during training, rendering them unstable regardless of hyperparameter initialization. Leveraging the perspective of effective field theory, we identify the root causes of these failures and propose the Modified Rectified Power Unit (MRePU) activation function. MRePU addresses RePU's limitations while preserving its advantages, such as differentiability and universal approximation properties. Theoretical analysis demonstrates that MRePU satisfies criticality conditions necessary for stable training, placing it in a distinct universality class. Extensive experiments validate the effectiveness of MRePU, showing significant improvements in training stability and performance across various tasks, including polynomial regression, physics-informed neural networks (PINNs) and real-world vision tasks. Our findings highlight the potential of MRePU as a robust alternative for building deep neural networks.
Read more →
SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2408.05235v2 Announce Type: replace-cross Abstract: As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present \textit{throttLL'eM}, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. \textit{throttLL'eM} features mechanisms that project future KV cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, \textit{throttLL'eM} manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves $R^2$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that \textit{throttLL'eM} achieves up to 43.8\% lower energy consumption and an energy efficiency improvement of at least $1.71\times$ under SLOs, when compared to NVIDIA's Triton server.
Read more →
Large Language Model-Based Agents for Software Engineering: A Survey
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2409.02977v2 Announce Type: replace-cross Abstract: The recent advance in Large Language Models (LLMs) has shaped a new paradigm of AI agents, i.e., LLM-based agents. Compared to standalone LLMs, LLM-based agents substantially extend the versatility and expertise of LLMs by enhancing LLMs with the capabilities of perceiving and utilizing external resources and tools. To date, LLM-based agents have been applied and shown remarkable effectiveness in Software Engineering (SE). The synergy between multiple agents and human interaction brings further promise in tackling complex real-world SE problems. In this work, we present a comprehensive and systematic survey on LLM-based agents for SE. We collect 124 papers and categorize them from two perspectives, i.e., the SE and agent perspectives. In addition, we discuss open challenges and future directions in this critical domain. The repository of this survey is at https://github.com/FudanSELab/Agent4SE-Paper-List.
Read more →
IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2409.18980v2 Announce Type: replace-cross Abstract: Recently advancements in large multimodal models have led to significant strides in image comprehension capabilities. Despite these advancements, there is a lack of the robust benchmark specifically for assessing the Image-to-Web conversion proficiency of these large models. Primarily, it is essential to ensure the integrity of the web elements generated. These elements comprise visible and invisible categories. Previous evaluation methods (e.g.,BLEU) are notably susceptible to significant alterations due to the presence of invisible elements in Web. Furthermore, it is crucial to measure the layout information of web pages, referring to the positional relationships between elements, which is overlooked by previous work. To address challenges, we have curated and aligned a benchmark of images and corresponding web codes (IW-BENCH). Specifically, we propose the Element Accuracy, which tests the completeness of the elements by parsing the Document Object Model (DOM) tree. Layout Accuracy is also proposed to analyze the positional relationships of elements by converting DOM tree into a common subsequence. Besides, we design a five-hop multimodal Chain-of-Thought Prompting for better performance, which contains five hop: 1) SoM prompt injection. 2) Inferring Elements. 3) Inferring Layout. 4) Inferring Web code. 5) Reflection. Our benchmark comprises 1200 pairs of images and web codes with varying levels of difficulty. We have conducted extensive experiments on existing large multimodal models, offering insights into their performance and areas for improvement in image-to-web domain.
Read more →
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2411.05826v2 Announce Type: replace-cross Abstract: Remote sensing has evolved from simple image acquisition to complex systems capable of integrating and processing visual and textual data. This review examines the development and application of multi-modal language models (MLLMs) in remote sensing, focusing on their ability to interpret and describe satellite imagery using natural language. We cover the technical underpinnings of MLLMs, including dual-encoder architectures, Transformer models, self-supervised and contrastive learning, and cross-modal integration. The unique challenges of remote sensing data--varying spatial resolutions, spectral richness, and temporal changes--are analyzed for their impact on MLLM performance. Key applications such as scene description, object detection, change detection, text-to-image retrieval, image-to-text generation, and visual question answering are discussed to demonstrate their relevance in environmental monitoring, urban planning, and disaster response. We review significant datasets and resources supporting the training and evaluation of these models. Challenges related to computational demands, scalability, data quality, and domain adaptation are highlighted. We conclude by proposing future research directions and technological advancements to further enhance MLLM utility in remote sensing.
Read more →
Large Language Models: An Applied Econometric Framework
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2412.07031v3 Announce Type: replace-cross Abstract: Large language models (LLMs) enable researchers to analyze text at unprecedented scale and minimal cost. Researchers can now revisit old questions and tackle novel ones with rich data. We provide an econometric framework for realizing this potential in two empirical uses. For prediction problems -- forecasting outcomes from text -- valid conclusions require ``no training leakage'' between the LLM's training data and the researcher's sample, which can be enforced through careful model choice and research design. For estimation problems -- automating the measurement of economic concepts for downstream analysis -- valid downstream inference requires combining LLM outputs with a small validation sample to deliver consistent and precise estimates. Absent a validation sample, researchers cannot assess possible errors in LLM outputs, and consequently seemingly innocuous choices (which model, which prompt) can produce dramatically different parameter estimates. When used appropriately, LLMs are powerful tools that can expand the frontier of empirical economics.
Read more →
A Survey on Recommendation Unlearning: Fundamentals, Taxonomy, Evaluation, and Open Questions
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2412.12836v2 Announce Type: replace-cross Abstract: Recommender systems have become increasingly influential in shaping user behavior and decision-making, highlighting their growing impact in various domains. Meanwhile, the widespread adoption of machine learning models in recommender systems has raised significant concerns regarding user privacy and security. As compliance with privacy regulations becomes more critical, there is a pressing need to address the issue of recommendation unlearning, i.e., eliminating the memory of specific training data from the learned recommendation models. Despite its importance, traditional machine unlearning methods are ill-suited for recommendation unlearning due to the unique challenges posed by collaborative interactions and model parameters. This survey offers a comprehensive review of the latest advancements in recommendation unlearning, exploring the design principles, challenges, and methodologies associated with this emerging field. We provide a unified taxonomy that categorizes different recommendation unlearning approaches, followed by a summary of widely used benchmarks and metrics for evaluation. By reviewing the current state of research, this survey aims to guide the development of more efficient, scalable, and robust recommendation unlearning techniques. Furthermore, we identify open research questions in this field, which could pave the way for future innovations not only in recommendation unlearning but also in a broader range of unlearning tasks across different machine learning applications.
Read more →
ORACLE: A Real-Time, Hierarchical, Deep-Learning Photometric Classifier for the LSST
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2501.01496v2 Announce Type: replace-cross Abstract: We present ORACLE, the first hierarchical deep-learning model for real-time, context-aware classification of transient and variable astrophysical phenomena. ORACLE is a recurrent neural network with Gated Recurrent Units (GRUs), and has been trained using a custom hierarchical cross-entropy loss function to provide high-confidence classifications along an observationally-driven taxonomy with as little as a single photometric observation. Contextual information for each object, including host galaxy photometric redshift, offset, ellipticity and brightness, is concatenated to the light curve embedding and used to make a final prediction. Training on $\sim$0.5M events from the Extended LSST Astronomical Time-Series Classification Challenge, we achieve a top-level (Transient vs Variable) macro-averaged precision of 0.96 using only 1 day of photometric observations after the first detection in addition to contextual information, for each event; this increases to $>$0.99 once 64 days of the light curve has been obtained, and 0.83 at 1024 days after first detection for 19-way classification (including supernova sub-types, active galactic nuclei, variable stars, microlensing events, and kilonovae). We also compare ORACLE with other state-of-the-art classifiers and report comparable performance for the 19-way classification task, in addition to delivering accurate top-level classifications much earlier. The code and model weights used in this work are publicly available at our associated GitHub repository (https://github.com/uiucsn/ELAsTiCC-Classification).
Read more →
Why is the estimation of metaorder impact with public market data so challenging?
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2501.17096v2 Announce Type: replace-cross Abstract: Estimating market impact and transaction costs of large trades (metaorders) is a very important topic in finance. However, using models of price and trade based on public market data provide average price trajectories which are qualitatively different from what is observed during real metaorder executions: the price increases linearly, rather than in a concave way, during the execution and the amount of reversion after its end is very limited. We claim that this is a generic phenomenon due to the fact that even sophisticated statistical models are unable to correctly describe the origin of the autocorrelation of the order flow. We propose a modified Transient Impact Model which provides more realistic trajectories by assuming that only a fraction of the metaorder trading triggers market order flow. Interestingly, in our model there is a critical condition on the kernels of the price and order flow equations in which market impact becomes permanent.
Read more →
Scaling Multimodal Search and Recommendation with Small Language Models via Upside-Down Reinforcement Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2502.09854v2 Announce Type: replace-cross Abstract: In this work, we investigate how small language models (SLMs) can be scaled to support multimodal search and recommendation use cases while remaining efficient enough for real-time, resource-constrained deployments. We present a framework that combines upside-down reinforcement learning with synthetic data distillation from a large language model (Llama-3) to train a 100M-parameter GPT-2 model for multitask prompt generation. Despite being up to 80 times smaller than state-of-the-art large language models (LLMs), our SLM achieves relevance and diversity scores within 6% of competitive baselines such as Llama-3 8B, Qwen3 8B, and Ministral 8B. These results demonstrate that SLMs can effectively handle multimodal search and recommendation tasks, while dramatically reducing inference latency and memory overhead. Our study highlights the potential of lightweight models as practical engines for scalable multimodal discovery, bridging the gap between cutting-edge research and real-world multimodal applications such as media recommendations and creative content generation.
Read more →
Privacy is All You Need: Revolutionizing Wearable Health Data with Advanced PETs
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2503.03428v2 Announce Type: replace-cross Abstract: In a world where data is the new currency, wearable health devices offer unprecedented insights into daily life, continuously monitoring vital signs and metrics. However, this convenience raises privacy concerns, as these devices collect sensitive data that can be misused or breached. Traditional measures often fail due to real-time data processing needs and limited device power. Users also lack awareness and control over data sharing and usage. We propose a Privacy-Enhancing Technology (PET) framework for wearable devices, integrating federated learning, lightweight cryptographic methods, and selectively deployed blockchain technology. The blockchain acts as a secure ledger triggered only upon data transfer requests, granting users real-time notifications and control. By dismantling data monopolies, this approach returns data sovereignty to individuals. Through real-world applications like secure medical data sharing, privacy-preserving fitness tracking, and continuous health monitoring, our framework reduces privacy risks by up to 70 percent while preserving data utility and performance. This innovation sets a new benchmark for wearable privacy and can scale to broader IoT ecosystems, including smart homes and industry. As data continues to shape our digital landscape, our research underscores the critical need to maintain privacy and user control at the forefront of technological progress.
Read more →
AugMapNet: Improving Spatial Latent Structure via BEV Grid Augmentation for Enhanced Vectorized Online HD Map Construction
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2503.13430v2 Announce Type: replace-cross Abstract: Autonomous driving requires understanding infrastructure elements, such as lanes and crosswalks. To navigate safely, this understanding must be derived from sensor data in real-time and needs to be represented in vectorized form. Learned Bird's-Eye View (BEV) encoders are commonly used to combine a set of camera images from multiple views into one joint latent BEV grid. Traditionally, from this latent space, an intermediate raster map is predicted, providing dense spatial supervision but requiring post-processing into the desired vectorized form. More recent models directly derive infrastructure elements as polylines using vectorized map decoders, providing instance-level information. Our approach, Augmentation Map Network (AugMapNet), proposes latent BEV feature grid augmentation, a novel technique that significantly enhances the latent BEV representation. AugMapNet combines vector decoding and dense spatial supervision more effectively than existing architectures while remaining easy to integrate compared to other hybrid approaches. It additionally benefits from extra processing on its latent BEV features. Experiments on nuScenes and Argoverse2 datasets demonstrate significant improvements on vectorized map prediction of up to 13.3% over the StreamMapNet baseline on 60 m range and greater improvements on larger ranges. We confirm transferability by applying our method to another baseline, SQD-MapNet, and find similar improvements. A detailed analysis of the latent BEV grid confirms a more structured latent space of AugMapNet and shows the value of our novel concept beyond pure performance improvement. The code can be found at https://github.com/tmonnin/augmapnet
Read more →
From Distance to Direction: Structure-aware Label-specific Feature Fusion for Label Distribution Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2504.19374v2 Announce Type: replace-cross Abstract: Label distribution learning (LDL) is an emerging learning paradigm designed to capture the relative importance of labels for each instance. Label-specific features (LSFs), constructed by LIFT, have proven effective for learning tasks with label ambiguity by leveraging clustering-based prototypes for each label to re-characterize instances. However, directly introducing LIFT into LDL tasks can be suboptimal, as the prototypes it collects primarily reflect intra-cluster relationships while neglecting cross-cluster interactions. Additionally, constructing LSFs using multi-perspective information, rather than relying solely on Euclidean distance, provides a more robust and comprehensive representation of instances, mitigating noise and bias that may arise from a single distance perspective. To address these limitations, we introduce Structural Anchor Points (SAPs) to capture inter-cluster interactions. This leads to a novel LSFs construction strategy, LIFT-SAP, which enhances LIFT by integrating both distance and directional information of each instance relative to SAPs. Furthermore, we propose a novel LDL algorithm, Label Distribution Learning via Label-specifIc FeaTure with SAPs (LDL-LIFT-SAP), which unifies multiple label description degrees predicted from different LSF spaces into a cohesive label distribution. Extensive experiments on 15 real-world datasets demonstrate the effectiveness of LIFT-SAP over LIFT, as well as the superiority of LDL-LIFT-SAP compared to seven other well-established algorithms.
Read more →
Online Learning-based Adaptive Beam Switching for 6G Networks: Enhancing Efficiency and Resilience
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2505.08032v2 Announce Type: replace-cross Abstract: Adaptive beam switching is essential for mission-critical military and commercial 6G networks but faces major challenges from high carrier frequencies, user mobility, and frequent blockages. While existing machine learning (ML) solutions often focus on maximizing instantaneous throughput, this can lead to unstable policies with high signaling overhead. This paper presents an online Deep Reinforcement Learning (DRL) framework designed to learn an operationally stable policy. By equipping the DRL agent with an enhanced state representation that includes blockage history, and a stability-centric reward function, we enable it to prioritize long-term link quality over transient gains. Validated in a challenging 100-user scenario using the Sionna library, our agent achieves throughput comparable to a reactive Multi-Armed Bandit (MAB) baseline. Specifically, our proposed framework improves link stability by approximately 43% compared to a vanilla DRL approach, achieving operational reliability competitive with MAB while maintaining high data rates. This work demonstrates that by reframing the optimization goal towards operational stability, DRL can deliver efficient, reliable, and real-time beam management solutions for next-generation mission-critical networks.
Read more →
Challenges and Limitations of Generative AI in Synthesizing Wearable Sensor Data
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2505.14206v2 Announce Type: replace-cross Abstract: The widespread adoption of wearable sensors has the potential to provide massive and heterogeneous time series data, driving the use of Artificial Intelligence in human sensing applications. However, data collection remains limited due to stringent ethical regulations, privacy concerns, and other constraints, hindering progress in the field. Synthetic data generation, particularly through Generative Adversarial Networks and Diffusion Models, has emerged as a promising solution to mitigate both data scarcity and privacy issues. However, these models are often limited to narrow operational scenarios, such as short-term and unimodal signal patterns. To address this gap, we present a systematic evaluation of state-of-the-art generative models for time series data, explicitly assessing their performance in challenging scenarios such as stress and emotion recognition. Our study examines the extent to which these models can jointly handle multi-modality, capture long-range dependencies, and support conditional generation-core requirements for real-world wearable sensor data generation. To enable a fair and rigorous comparison, we also introduce an evaluation framework that evaluates both the intrinsic fidelity of the generated data and their utility in downstream predictive tasks. Our findings reveal critical limitations in the existing approaches, particularly in maintaining cross-modal consistency, preserving temporal coherence, and ensuring robust performance in train-on-synthetic, test-on-real, and data augmentation scenarios. Finally, we present our future research directions to enhance synthetic time series generation and improve the applicability of generative models in the wearable computing domain.
Read more →
Can VLMs Detect and Localize Fine-Grained AI-Edited Images?
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2505.15644v2 Announce Type: replace-cross Abstract: Fine-grained detection and localization of localized image edits is crucial for assessing content authenticity, especially as modern diffusion models and image editors can produce highly realistic manipulations. However, this problem faces three key challenges: (1) most AIGC detectors produce only a global real-or-fake label without indicating where edits occur; (2) traditional computer vision methods for edit localization typically rely on costly pixel-level annotations; and (3) there is no large-scale, modern benchmark specifically targeting edited-image detection. To address these gaps, we develop an automated data-generation pipeline and construct FragFake, a large-scale benchmark of AI-edited images spanning multiple source datasets, diverse editing models, and several common edit types. Building on FragFake, we are the first to systematically study vision language models (VLMs) for edited-image classification and edited-region localization. Our experiments show that pretrained VLMs, including GPT4o, perform poorly on this task, whereas fine-tuned models such as Qwen2.5-VL achieve high accuracy and substantially higher object precision across all settings. We further explore GRPO-based RLVR training, which yields modest metric gains while improving the interpretability of model outputs. Ablation and transfer analyses reveal how data balancing, training size, LoRA rank, and training domain affect performance, and highlight both the potential and the limitations of cross-editor and cross-dataset generalization. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.
Read more →
ConfRover: Simultaneous Modeling of Protein Conformation and Dynamics via Autoregression
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2505.17478v2 Announce Type: replace-cross Abstract: Understanding protein dynamics is critical for elucidating their biological functions. The increasing availability of molecular dynamics (MD) data enables the training of deep generative models to efficiently explore the conformational space of proteins. However, existing approaches either fail to explicitly capture the temporal dependencies between conformations or do not support direct generation of time-independent samples. To address these limitations, we introduce ConfRover, an autoregressive model that simultaneously learns protein conformation and dynamics from MD trajectories, supporting both time-dependent and time-independent sampling. At the core of our model is a modular architecture comprising: (i) an encoding layer, adapted from protein folding models, that embeds protein-specific information and conformation at each time frame into a latent space; (ii) a temporal module, a sequence model that captures conformational dynamics across frames; and (iii) an SE(3) diffusion model as the structure decoder, generating conformations in continuous space. Experiments on ATLAS, a large-scale protein MD dataset of diverse structures, demonstrate the effectiveness of our model in learning conformational dynamics and supporting a wide range of downstream tasks. ConfRover is the first model to sample both protein conformations and trajectories within a single framework, offering a novel and flexible approach for learning from protein MD data. Project website: https://bytedance-seed.github.io/ConfRover.
Read more →
Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2505.18098v2 Announce Type: replace-cross Abstract: Large language models (LLMs) excel in tasks like question answering and dialogue, but complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning. Reinforcement learning (RL) fine-tuning can enable such planning in principle, but suffers from drawbacks that hinder scalability. In particular, multi-turn RL training incurs high memory and computational costs, which are exacerbated when training LLMs as policies. Furthermore, the largest LLMs do not expose the APIs necessary to be trained in such manner. As a result, modern methods to improve the reasoning of LLMs rely on sophisticated prompting mechanisms rather than RL fine-tuning. To remedy this, we propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents, that scales even to large API-based models. These value functions predict how a task will unfold given an action, allowing the LLM agent to evaluate multiple possible outcomes, both positive and negative, to plan effectively. In addition, these value functions are trained over reasoning steps rather than full actions, to be a concise and light-weight module that facilitates decision-making in multi-turn interactions. We validate our method on tasks requiring interaction, including tool use, social deduction, and dialogue, demonstrating superior performance over both RL fine-tuning and prompting methods while maintaining efficiency and scalability.
Read more →
SATORI-R1: Incentivizing Multimodal Reasoning through Explicit Visual Anchoring
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2505.19094v2 Announce Type: replace-cross Abstract: DeepSeek-R1 has demonstrated powerful reasoning capabilities in the text domain through stable reinforcement learning (RL). Recently, in the multimodal domain, works have begun to directly apply RL to generate R1-like free-form reasoning for Visual Question Answering (VQA) tasks. However, multimodal tasks share an intrinsically different nature from textual tasks, which heavily rely on the understanding of the input image to solve the problem. Therefore, such free-form reasoning faces two critical limitations in the VQA task: (1) Extended reasoning chains diffuse visual focus away from task-critical regions, degrading answer accuracy. (2) Unverifiable intermediate steps amplify policy-gradient variance and computational costs overhead. To address these issues, in this paper, we introduce SATORI ($\textbf{S}patially$ $\textbf{A}nchored$ $\textbf{T}ask$ $\textbf{O}ptimization$ with $\textbf{R}e\textbf{I}nforcement$ Learning), which decomposes VQA into three verifiable stages, including global image captioning, region localization, and answer prediction, each supplying explicit reward signals. Furthermore, we also introduce VQA-Verify, a 12k dataset annotated with answer-aligned captions and bounding-boxes to facilitate training. Experiments demonstrate consistent performance improvements across seven VQA benchmarks, achieving up to $15.7\%$ improvement in accuracy in accuracy compared to the R1-like baseline. Our analysis of the attention map confirms enhanced focus on critical regions, which brings improvements in accuracy. Our code is available at https://github.com/justairr/SATORI-R1.
Read more →
Unintentional Consequences: Generative AI Use for Cybercrime
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2505.23733v2 Announce Type: replace-cross Abstract: The democratization of generative AI introduces new forms of human-AI interaction and raises urgent safety, ethical, and cybersecurity concerns. We develop a socio-technical explanation for how generative AI enables and scales cybercrime. Drawing on affordance theory and technological amplification, we argue that generative AI systems create new action possibilities for cybercriminals and magnify pre-existing malicious intent by lowering expertise barriers and increasing attack efficiency. To illustrate this framework, we conduct interrupted time series analyses of two large datasets: (1) 464,190,074 malicious IP address reports from AbuseIPDB, and (2) 281,115 cryptocurrency scam reports from Chainabuse. Using November 30, 2022, as a high-salience public-access shock, we estimate the counterfactual trajectory of reported cyber abuse absent the release, providing an early-warning impact assessment of a general-purpose AI technology. Across both datasets, we observe statistically significant post-intervention increases in reported malicious activity, including an immediate increase of over 1.12 million weekly malicious IP reports and about 722 weekly cryptocurrency scam reports, with sustained growth in the latter. We discuss implications for AI governance, platform-level regulation, and cyber resilience, emphasizing the need for multi-layer socio-technical strategies that help key stakeholders maximize AI's benefits while mitigating its growing cybercrime risks.
Read more →
Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2506.00195v2 Announce Type: replace-cross Abstract: Current LLMs are trained to refuse potentially harmful input queries regardless of whether users actually had harmful intents, causing a tradeoff between safety and user experience. Through a study of 480 participants evaluating 3,840 query-response pairs, we examine how different refusal strategies affect user perceptions across varying motivations. Our findings reveal that response strategy largely shapes user experience, while actual user motivation has negligible impact. Partial compliance -- providing general information without actionable details -- emerges as the optimal strategy, reducing negative user perceptions by over 50% to flat-out refusals. Complementing this, we analyze response patterns of 9 state-of-the-art LLMs and evaluate how 6 reward models score different refusal strategies, demonstrating that models rarely deploy partial compliance naturally and reward models currently undervalue it. This work demonstrates that effective guardrails require focusing on crafting thoughtful refusals rather than detecting intent, offering a path toward AI safety mechanisms that ensure both safety and sustained user engagement.
Read more →
SafeGenes: Evaluating the Adversarial Robustness of Genomic Foundation Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2506.00821v2 Announce Type: replace-cross Abstract: Genomic Foundation Models (GFMs), such as Evolutionary Scale Modeling (ESM), have demonstrated significant success in variant effect prediction. However, their adversarial robustness remains largely unexplored. To address this gap, we propose SafeGenes: a framework for Secure analysis of genomic foundation models, leveraging adversarial attacks to evaluate robustness against both engineered near-identical adversarial Genes and embedding-space manipulations. In this study, we assess the adversarial vulnerabilities of GFMs using two approaches: the Fast Gradient Sign Method (FGSM) and a soft prompt attack. FGSM introduces minimal perturbations to input sequences, while the soft prompt attack optimizes continuous embeddings to manipulate model predictions without modifying the input tokens. By combining these techniques, SafeGenes provides a comprehensive assessment of GFM susceptibility to adversarial manipulation. Targeted soft prompt attacks induced severe degradation in MLM-based shallow architectures such as ProteinBERT, while still producing substantial failure modes even in high-capacity foundation models such as ESM1b and ESM1v. These findings expose critical vulnerabilities in current foundation models, opening new research directions toward improving their security and robustness in high-stakes genomic applications such as variant effect prediction.
Read more →
AMPED: Adaptive Multi-objective Projection for balancing Exploration and skill Diversification
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2506.05980v4 Announce Type: replace-cross Abstract: Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both: during pre-training, a gradient-surgery projection balances the exploration and diversity gradients, and during fine-tuning, a skill selector exploits the learned diversity by choosing skills suited to downstream tasks. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. Through an extensive ablation study, we identify the role of each component and demonstrate that each element in AMPED is contributing to performance. We further provide theoretical and empirical evidence that, with a greedy skill selector, greater skill diversity reduces fine-tuning sample complexity. These results highlight the importance of explicitly harmonizing exploration and diversity and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning. Project Page: https://geonwoo.me/amped/
Read more →
A$^2$LC: Active and Automated Label Correction for Semantic Segmentation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2506.11599v2 Announce Type: replace-cross Abstract: Active Label Correction (ALC) has emerged as a promising solution to the high cost and error-prone nature of manual pixel-wise annotation in semantic segmentation, by actively identifying and correcting mislabeled data. Although recent work has improved correction efficiency by generating pseudo-labels using foundation models, substantial inefficiencies still remain. In this paper, we introduce A$^2$LC, an Active and Automated Label Correction framework for semantic segmentation, where manual and automatic correction stages operate in a cascaded manner. Specifically, the automatic correction stage leverages human feedback to extend label corrections beyond the queried samples, thereby maximizing cost efficiency. In addition, we introduce an adaptively balanced acquisition function that emphasizes underrepresented tail classes, working in strong synergy with the automatic correction stage. Extensive experiments on Cityscapes and PASCAL VOC 2012 demonstrate that A$^2$LC significantly outperforms previous state-of-the-art methods. Notably, A$^2$LC exhibits high efficiency by outperforming previous methods with only 20% of their budget, and shows strong effectiveness by achieving a 27.23% performance gain under the same budget on Cityscapes.
Read more →
TRACED: Transition-aware Regret Approximation with Co-learnability for Environment Design
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2506.19997v4 Announce Type: replace-cross Abstract: Generalizing deep reinforcement learning agents to unseen environments remains a significant challenge. One promising solution is Unsupervised Environment Design (UED), a co-evolutionary framework in which a teacher adaptively generates tasks with high learning potential, while a student learns a robust policy from this evolving curriculum. Existing UED methods typically measure learning potential via regret, the gap between optimal and current performance, approximated solely by value-function loss. Building on these approaches, we introduce the transition-prediction error as an additional term in our regret approximation. To capture how training on one task affects performance on others, we further propose a lightweight metric called Co-Learnability. By combining these two measures, we present Transition-aware Regret Approximation with Co-learnability for Environment Design (TRACED). Empirical evaluations show that TRACED produces curricula that improve zero-shot generalization over strong baselines across multiple benchmarks. Ablation studies confirm that the transition-prediction error drives rapid complexity ramp-up and that Co-Learnability delivers additional gains when paired with the transition-prediction error. These results demonstrate how refined regret approximation and explicit modeling of task relationships can be leveraged for sample-efficient curriculum design in UED. Project Page: https://geonwoo.me/traced/
Read more →
BitMark: Watermarking Bitwise Autoregressive Image Generative Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2506.21209v2 Announce Type: replace-cross Abstract: State-of-the-art text-to-image models generate photorealistic images at an unprecedented speed. This work focuses on models that operate in a bitwise autoregressive manner over a discrete set of tokens that is practically infinite in size. However, their impressive generative power comes with a growing risk: as their outputs increasingly populate the Internet, they are likely to be scraped and reused as training data-potentially by the very same models. This phenomenon has been shown to lead to model collapse, where repeated training on generated content, especially from the models' own previous versions, causes a gradual degradation in performance. A promising mitigation strategy is watermarking, which embeds human-imperceptible yet detectable signals into generated images-enabling the identification of generated content. In this work, we introduce BitMark, a robust bitwise watermarking framework. Our method embeds a watermark directly at the bit level of the token stream during the image generation process. Our bitwise watermark subtly influences the bits to preserve visual fidelity and generation speed while remaining robust against a spectrum of removal techniques. Furthermore, it exhibits high radioactivity, i.e., when watermarked generated images are used to train another image generative model, this second model's outputs will also carry the watermark. The radioactive traces remain detectable even when only fine-tuning diffusion or image autoregressive models on images watermarked with our BitMark. Overall, our approach provides a principled step toward preventing model collapse in image generative models by enabling reliable detection of generated outputs. The code is available at https://github.com/sprintml/BitMark.
Read more →
Class conditional conformal prediction for multiple inputs by p-value aggregation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2507.07150v2 Announce Type: replace-cross Abstract: Conformal prediction methods are statistical tools designed to quantify uncertainty and generate predictive sets with guaranteed coverage probabilities. This work introduces an innovative refinement to these methods for classification tasks, specifically tailored for scenarios where multiple observations (multi-inputs) of a single instance are available at prediction time. Our approach is particularly motivated by applications in citizen science, where multiple images of the same plant or animal are captured by individuals. Our method integrates the information from each observation into conformal prediction, enabling a reduction in the size of the predicted label set while preserving the required class-conditional coverage guarantee. The approach is based on the aggregation of conformal p-values computed from each observation of a multi-input. By exploiting the exact distribution of these p-values, we propose a general aggregation framework using an abstract scoring function, encompassing many classical statistical tools. Knowledge of this distribution also enables refined versions of standard strategies, such as majority voting. We evaluate our method on simulated and real data, with a particular focus on Pl@ntNet, a prominent citizen science platform that facilitates the collection and identification of plant species through user-submitted images.
Read more →
Improving Wi-Fi Network Performance Prediction with Deep Learning Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2507.11168v2 Announce Type: replace-cross Abstract: The increasing need for robustness, reliability, and determinism in wireless networks for industrial and mission-critical applications is the driver for the growth of new innovative methods. The study presented in this work makes use of machine learning techniques to predict channel quality in a Wi-Fi network in terms of the frame delivery ratio. Predictions can be used proactively to adjust communication parameters at runtime and optimize network operations for industrial applications. Methods including convolutional neural networks and long short-term memory were analyzed on datasets acquired from a real Wi-Fi setup across multiple channels. The models were compared in terms of prediction accuracy and computational complexity. Results show that the frame delivery ratio can be reliably predicted, and convolutional neural networks, although slightly less effective than other models, are more efficient in terms of CPU usage and memory consumption. This enhances the model's usability on embedded and industrial systems.
Read more →
Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2507.12482v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have advanced code generation and software automation but remain constrained by inference-time context and lack structured reasoning over code, leaving debugging largely unsolved. While Claude 4.5 Opus achieves 74.40% on SWE-bench Verified and Gemini 3 Pro reaches 76.2%, both models remain below 20% on real multi-file debugging tasks. We introduce Kodezi Chronos-1, a language model purpose-built for debugging that integrates Adaptive Graph-Guided Retrieval to navigate codebases up to 10 million lines (92% precision, 85% recall), Persistent Debug Memory trained on over 15 million sessions, and a seven-layer fix-test-refine architecture. On 5,000 real-world scenarios, Chronos-1 achieves 67.3% +/- 2.1% fix accuracy compared to 14.2% +/- 1.3% for Claude 4.1 Opus and 13.8% +/- 1.2% for GPT-4.1 (Cohen's d = 3.87). On SWE-bench Lite, Chronos-1 reaches a state-of-the-art 80.33% resolution rate (241 of 300), outperforming the next best system by 20 points and achieving repository-specific highs of 96.1% on Sympy and 90.4% on Django. Chronos-1 reduces debugging time by 40% and iterations by 65%, resolving complex multi-file and cross-repository bugs that require temporal analysis. Limitations remain for hardware-dependent and dynamic language errors, and Chronos-1 will be available in Kodezi OS in Q4 2025 and via API in Q1 2026.
Read more →
The Right to be Forgotten in Pruning: Unveil Machine Unlearning on Sparse Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2507.18725v2 Announce Type: replace-cross Abstract: Machine unlearning aims to efficiently eliminate the memory about deleted data from trained models and address the right to be forgotten. Despite the success of existing unlearning algorithms, unlearning in sparse models has not yet been well studied. In this paper, we empirically find that the deleted data has an impact on the pruned topology in a sparse model. Motivated by the observation and the right to be forgotten, we define a new terminology ``un-pruning" to eliminate the impact of deleted data on model pruning. Then we propose an un-pruning algorithm to approximate the pruned topology driven by retained data. We remark that any existing unlearning algorithm can be integrated with the proposed un-pruning workflow and the error of un-pruning is upper-bounded in theory. Also, our un-pruning algorithm can be applied to both structured sparse models and unstructured sparse models. In the experiment, we further find that Membership Inference Attack (MIA) accuracy is unreliable for assessing whether a model has forgotten deleted data, as a small change in the amount of deleted data can produce arbitrary MIA results. Accordingly, we devise new performance metrics for sparse models to evaluate the success of un-pruning. Lastly, we conduct extensive experiments to verify the efficacy of un-pruning with various pruning methods and unlearning algorithms. Our code is released at https://github.com/NKUShaw/SparseModels .
Read more →
PCS Workflow for Veridical Data Science in the Age of AI
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2508.00835v2 Announce Type: replace-cross Abstract: Data science is a pillar of artificial intelligence (AI), which is transforming nearly every domain of human activity, from the social and physical sciences to engineering and medicine. While data-driven findings in AI offer unprecedented power to extract insights and guide decision-making, many are difficult or impossible to replicate. A key reason for this challenge is the uncertainty introduced by the many choices made throughout the data science life cycle (DSLC). Traditional statistical frameworks often fail to account for this uncertainty. The Predictability-Computability-Stability (PCS) framework for veridical (truthful) data science offers a principled approach to addressing this challenge throughout the DSLC. This paper presents an updated and streamlined PCS workflow, tailored for practitioners and enhanced with guided use of generative AI. We include a running example to display the PCS framework in action, and conduct a related case study which showcases the uncertainty in downstream predictions caused by judgment calls in the data cleaning stage.
Read more →
GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2508.03772v4 Announce Type: replace-cross Abstract: Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and analyze two main GRPO issues: (i) the token-level penalization, where valuable tokens shared across different responses receive contradictory feedback signals, leading to conflicting gradient updates that can reduce their likelihood; and (ii) the policy collapse, where negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, destabilizing training process. To address these issues we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which prevents conflicting gradients on valuable tokens by skipping negative updates while amplifying positive ones and filters out completions whose entropy exceeds a provable threshold, to prevent policy collapse. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, as validated through multiple experiments on GSM8K, MATH, AIME 2024, AIME 2025 and AMC 2023.
Read more →
Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2508.09442v2 Announce Type: replace-cross Abstract: The Key-Value (KV) cache, which stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations, is a fundamental mechanism for accelerating Large Language Model (LLM) inference. However, this efficiency optimization introduces significant yet underexplored privacy risks. This paper provides the first comprehensive analysis of these vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the KV-cache. We design and implement three distinct attack vectors: a direct Inversion Attack, a more broadly applicable and potent Collision Attack, and a semantic-based Injection Attack. These methods demonstrate the practicality and severity of KV-cache privacy leakage issues. To mitigate this, we propose KV-Cloak, a novel, lightweight, and efficient defense mechanism. KV-Cloak uses a reversible matrix-based obfuscation scheme, combined with operator fusion, to secure the KV-cache. Our extensive experiments show that KV-Cloak effectively thwarts all proposed attacks, reducing reconstruction quality to random noise. Crucially, it achieves this robust security with virtually no degradation in model accuracy and minimal performance overhead, offering a practical solution for trustworthy LLM deployment.
Read more →
A Novel Attention-Augmented Wavelet YOLO System for Real-time Brain Vessel Segmentation on Transcranial Color-coded Doppler
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2508.13875v2 Announce Type: replace-cross Abstract: The Circle of Willis (CoW), vital for ensuring consistent blood flow to the brain, is closely linked to ischemic stroke. Accurate assessment of the CoW is important for identifying individuals at risk and guiding appropriate clinical management. Among existing imaging methods, Transcranial Color-coded Doppler (TCCD) offers unique advantages due to its radiation-free nature, affordability, and accessibility. However, reliable TCCD assessments depend heavily on operator expertise for identifying anatomical landmarks and performing accurate angle correction, which limits its widespread adoption. To address this challenge, we propose an AI-powered, real-time CoW auto-segmentation system capable of efficiently capturing cerebral arteries. No prior studies have explored AI-driven cerebrovascular segmentation using TCCD. In this work, we introduce a novel Attention-Augmented Wavelet YOLO (AAW-YOLO) network tailored for TCCD data, designed to provide real-time guidance for brain vessel segmentation in the CoW. We prospectively collected TCCD data comprising 738 annotated frames and 3,419 labeled artery instances to establish a high-quality dataset for model training and evaluation. The proposed AAW-YOLO demonstrated strong performance in segmenting both ipsilateral and contralateral CoW vessels, achieving an average Dice score of 0.901, IoU of 0.823, precision of 0.882, recall of 0.926, and mAP of 0.953, with a per-frame inference speed of 14.199 ms. This system offers a practical solution to reduce reliance on operator experience in TCCD-based cerebrovascular screening, with potential applications in routine clinical workflows and resource-constrained settings. Future research will explore bilateral modeling and larger-scale validation.
Read more →
Sat2Flow: A Structure-Aware Diffusion Framework for Human Flow Generation from Satellite Imagery
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2508.19499v2 Announce Type: replace-cross Abstract: Origin-Destination (OD) flow matrices are critical for urban mobility analysis, supporting traffic forecasting, infrastructure planning, and policy design. Existing methods face two key limitations: (1) reliance on costly auxiliary features (e.g., Points of Interest, socioeconomic statistics) with limited spatial coverage, and (2) fragility to spatial topology changes, where reordering urban regions disrupts the structural coherence of generated flows. We propose Sat2Flow, a structure-aware diffusion framework that generates structurally coherent OD flows using only satellite imagery. Our approach employs a multi-kernel encoder to capture diverse regional interactions and a permutation-aware diffusion process that maintains consistency across regional orderings. Through joint contrastive training linking satellite features with OD patterns and equivariant diffusion training enforcing structural invariance, Sat2Flow ensures topological robustness under arbitrary regional reindexing. Experiments on real-world datasets show that Sat2Flow outperforms physics-based and data-driven baselines in accuracy while preserving flow distributions and spatial structures under index permutations. Sat2Flow offers a globally scalable solution for OD flow generation in data-scarce environments, eliminating region-specific auxiliary data dependencies while maintaining structural robustness for reliable mobility modeling.
Read more →
Access Paths for Efficient Ordering with Large Language Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2509.00303v2 Announce Type: replace-cross Abstract: In this work, we present the \texttt{LLM ORDER BY} semantic operator as a logical abstraction and conduct a systematic study of its physical implementations. First, we propose several improvements to existing semantic sorting algorithms and introduce a semantic-aware external merge sort algorithm. Our extensive evaluation reveals that no single implementation offers universal optimality on all datasets. From our evaluations, we observe a general test-time scaling relationship between sorting cost and the ordering quality for comparison-based algorithms. Building on these insights, we design a budget-aware optimizer that utilizes heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to dynamically select the near-optimal access path for LLM ORDER BY. In our extensive evaluations, our optimizer consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks. We believe that this work provides foundational insights into the principled optimization of semantic operators essential for building robust, large-scale LLM-powered analytic systems.
Read more →
Astra: A Multi-Agent System for GPU Kernel Performance Optimization
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2509.07506v2 Announce Type: replace-cross Abstract: GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining high performance typically requires extensive manual tuning. Compiler-based systems reduce some of this burden, but still demand substantial manual design and engineering effort. Recently, researchers have explored using LLMs for GPU kernel generation, though prior work has largely focused on translating high-level PyTorch modules into CUDA code. In this work, we introduce Astra, the first LLM-based multi-agent system for GPU kernel optimization. Unlike previous approaches, Astra starts from existing CUDA implementations extracted from SGLang, a widely deployed framework for serving LLMs, rather than treating PyTorch modules as the specification. Within Astra, specialized LLM agents collaborate through iterative code generation, testing, profiling, and planning to produce kernels that are both correct and high-performance. On kernels from SGLang, Astra achieves an average speedup of 1.32x using zero-shot prompting with OpenAI o4-mini. A detailed case study further demonstrates that LLMs can autonomously apply loop transformations, optimize memory access patterns, exploit CUDA intrinsics, and leverage fast math operations to yield substantial performance gains. Our work highlights multi-agent LLM systems as a promising new paradigm for GPU kernel optimization. Our code is publicly available at https://github.com/Anjiang-Wei/Astra.
Read more →
Investigating Bias: A Multilingual Pipeline for Generating, Solving, and Evaluating Math Problems with LLMs
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2509.17701v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solutions in each language. A held-out panel of LLM judges, including Claude 3.5 Haiku, evaluated solution quality using a comparative framework. Results show a consistent gap, with English solutions consistently rated highest, and Arabic often ranked lower. These findings highlight persistent linguistic bias and the need for more equitable multilingual AI systems in education.
Read more →
Observation-Free Attacks on Online Learning to Rank
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2509.22855v4 Announce Type: replace-cross Abstract: Online learning to rank (OLTR) plays a critical role in information retrieval and machine learning systems, with a wide range of applications in search engines and content recommenders. However, despite their extensive adoption, the susceptibility of OLTR algorithms to coordinated adversarial attacks remains poorly understood. In this work, we present a novel framework for attacking some of the widely used OLTR algorithms. Our framework is designed to promote a set of target items so that they appear in the list of top-K recommendations for T - o(T) rounds, while simultaneously inducing linear regret in the learning algorithm. We propose two novel attack strategies: CascadeOFA for CascadeUCB1 and PBMOFA for PBM-UCB . We provide theoretical guarantees showing that both strategies require only O(log T) manipulations to succeed. Additionally, we supplement our theoretical analysis with empirical results on real-world data.
Read more →
Accuracy-Robustness Trade Off via Spiking Neural Network Gradient Sparsity Trail
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2509.23762v3 Announce Type: replace-cross Abstract: Spiking Neural Networks (SNNs) have attracted growing interest in both computational neuroscience and artificial intelligence, primarily due to their inherent energy efficiency and compact memory footprint. However, achieving adversarial robustness in SNNs, (particularly for vision-related tasks) remains a nascent and underexplored challenge. Recent studies have proposed leveraging sparse gradients as a form of regularization to enhance robustness against adversarial perturbations. In this work, we present a surprising finding: under specific architectural configurations, SNNs exhibit natural gradient sparsity and can achieve state-of-the-art adversarial defense performance without the need for any explicit regularization. Further analysis reveals a trade-off between robustness and generalization: while sparse gradients contribute to improved adversarial resilience, they can impair the model's ability to generalize; conversely, denser gradients support better generalization but increase vulnerability to attacks. Our findings offer new insights into the dual role of gradient sparsity in SNN training.
Read more →
PerfBench: Can Agents Resolve Real-World Performance Bugs?
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2509.24091v3 Announce Type: replace-cross Abstract: Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown promise in automated bug fixing, existing benchmarks primarily focus on functional correctness and fail to evaluate agents' abilities to identify and resolve non-functional issues like performance bugs. We introduce PerfBench, a benchmark comprising 81 real-world performance bug-fixing tasks from popular .NET repositories on GitHub. Unlike existing benchmarks that rely on pre-existing test suites, PerfBench features a novel evaluation harness that allows agents to generate their own performance benchmarks and validates fixes by comparing execution metrics collected for developer fix and agent fix. Each task in PerfBench is derived from actual developer fixes linked to performance-related issues, which are then verified by human experts, ensuring real-world relevance. Our evaluation reveals that current state-of-the-art coding agents struggle with performance optimization tasks, with baseline OpenHands agent achieving only a ~3% success rate on our benchmark. We develop OpenHands-Perf-Agent, which incorporates performance-aware tooling and instructions and achieves a ~20% success rate on the benchmark. We show that by ensuring the agent has proper instructions to benchmark its changes and tooling for benchmark output processing, we can improve the agent performance significantly, but room for improvement still remains. PerfBench provides a challenging test set for furthering the capabilities of agents in fixing performance issues.
Read more →
Score Distillation of Flow Matching Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2509.25127v2 Announce Type: replace-cross Abstract: Diffusion models achieve high-quality image generation but are limited by slow iterative sampling. Distillation methods alleviate this by enabling one- or few-step generation. Flow matching, originally introduced as a distinct framework, has since been shown to be theoretically equivalent to diffusion under Gaussian assumptions, raising the question of whether distillation techniques such as score distillation transfer directly. We provide a simple derivation -- based on Bayes' rule and conditional expectations -- that unifies Gaussian diffusion and flow matching without relying on ODE/SDE formulations. Building on this view, we extend Score identity Distillation (SiD) to pretrained text-to-image flow-matching models, including SANA, SD3-Medium, SD3.5-Medium/Large, and FLUX.1-dev, all with DiT backbones. Experiments show that, with only modest flow-matching- and DiT-specific adjustments, SiD works out of the box across these models, in both data-free and data-aided settings, without requiring teacher finetuning or architectural changes. This provides the first systematic evidence that score distillation applies broadly to text-to-image flow matching models, resolving prior concerns about stability and soundness and unifying acceleration techniques across diffusion- and flow-based generators. A project page is available at https://yigu1008.github.io/SiD-DiT.
Read more →
Ergodic Risk Measures: Towards a Risk-Aware Foundation for Continual Reinforcement Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2510.02945v2 Announce Type: replace-cross Abstract: Continual reinforcement learning (continual RL) seeks to formalize the notions of lifelong learning and endless adaptation in RL. In particular, the aim of continual RL is to develop RL agents that can maintain a careful balance between retaining useful information and adapting to new situations. To date, continual RL has been explored almost exclusively through the lens of risk-neutral decision-making, in which the agent aims to optimize the expected long-run performance. In this work, we present the first formal theoretical treatment of continual RL through the lens of risk-aware decision-making, in which the behaviour of the agent is directed towards optimizing a measure of long-run performance beyond the mean. In particular, we show that the classical theory of risk measures, widely used as a theoretical foundation in non-continual risk-aware RL, is, in its current form, incompatible with continual learning. Then, building on this insight, we extend risk measure theory into the continual setting by introducing a new class of ergodic risk measures that are compatible with continual learning. Finally, we provide a case study of risk-aware continual learning, along with empirical results, which show the intuitive appeal of ergodic risk measures in continual settings.
Read more →
Universal Multi-Domain Translation via Diffusion Routers
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2510.03252v2 Announce Type: replace-cross Abstract: Multi-domain translation (MDT) aims to learn translations between multiple domains, yet existing approaches either require fully aligned tuples or can only handle domain pairs seen in training, limiting their practicality and excluding many cross-domain mappings. We introduce universal MDT (UMDT), a generalization of MDT that seeks to translate between any pair of $K$ domains using only $K-1$ paired datasets with a central domain. To tackle this problem, we propose Diffusion Router (DR), a unified diffusion-based framework that models all central$\leftrightarrow$non-central translations with a single noise predictor conditioned on the source and target domain labels. DR enables indirect non-central translations by routing through the central domain. We further introduce a novel scalable learning strategy with a variational-bound objective and an efficient Tweedie refinement procedure to support direct non-central mappings. Through evaluation on three large-scale UMDT benchmarks, DR achieves state-of-the-art results for both indirect and direct translations, while lowering sampling cost and unlocking novel tasks such as sketch$\leftrightarrow$segmentation. These results establish DR as a scalable and versatile framework for universal translation across multiple domains.
Read more →
Detecting Invariant Manifolds in ReLU-Based RNNs
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2510.03814v3 Announce Type: replace-cross Abstract: Recurrent Neural Networks (RNNs) have found widespread applications in machine learning for time series prediction and dynamical systems reconstruction, and experienced a recent renaissance with improved training algorithms and architectural designs. Understanding why and how trained RNNs produce their behavior is important for scientific and medical applications, and explainable AI more generally. An RNN's dynamical repertoire depends on the topological and geometrical properties of its state space. Stable and unstable manifolds of periodic points play a particularly important role: They dissect a dynamical system's state space into different basins of attraction, and their intersections lead to chaotic dynamics with fractal geometry. Here we introduce a novel algorithm for detecting these manifolds, with a focus on piecewise-linear RNNs (PLRNNs) employing rectified linear units (ReLUs) as their activation function. We demonstrate how the algorithm can be used to trace the boundaries between different basins of attraction, and hence to characterize multistability, a computationally important property. We further show its utility in finding so-called homoclinic points, the intersections between stable and unstable manifolds, and thus establish the existence of chaos in PLRNNs. Finally we show for an empirical example, electrophysiological recordings from a cortical neuron, how insights into the underlying dynamics could be gained through our method.
Read more →
Monte Carlo-Type Neural Operator for Differential Equations
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2510.05620v2 Announce Type: replace-cross Abstract: The Monte Carlo-type Neural Operator (MCNO) introduces a framework for learning solution operators of one-dimensional partial differential equations (PDEs) by directly learning the kernel function and approximating the associated integral operator using a Monte Carlo-type approach. Unlike Fourier Neural Operators (FNOs), which rely on spectral representations and assume translation-invariant kernels, MCNO makes no such assumptions. The kernel is represented as a learnable tensor over sampled input-output pairs, and sampling is performed once, uniformly at random from a discretized grid. This design enables generalization across multiple grid resolutions without relying on fixed global basis functions or repeated sampling during training, while an interpolation step maps between arbitrary input and output grids to further enhance flexibility. Experiments on standard 1D PDE benchmarks show that MCNO achieves competitive accuracy with efficient computational cost. We also provide a theoretical analysis proving that the Monte Carlo estimator yields a bounded bias and variance under mild regularity assumptions. This result holds in any spatial dimension, suggesting that MCNO may extend naturally beyond one-dimensional problems. More broadly, this work explores how Monte Carlo-type integration can be incorporated into neural operator frameworks for continuous-domain PDEs, providing a theoretically supported alternative to spectral methods (such as FNO) and to graph-based Monte Carlo approaches (such as the Graph Kernel Neural Operator, GNO).
Read more →
VLSU: Mapping the Limits of Joint Multimodal Understanding for AI Safety
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2510.18214v2 Announce Type: replace-cross Abstract: Safety evaluation of multimodal foundation models often treats vision and language inputs separately, missing risks from joint interpretation where benign content becomes harmful in combination. Existing approaches also fail to distinguish clearly unsafe content from borderline cases, leading to problematic over-blocking or under-refusal of genuinely harmful content. We present Vision Language Safety Understanding (VLSU), a comprehensive framework to systematically evaluate multimodal safety through fine-grained severity classification and combinatorial analysis across 17 distinct safety patterns. Using a multi-stage pipeline with real-world images and human annotation, we construct a large-scale benchmark of 8,187 samples spanning 15 harm categories. Our evaluation of eleven state-of-the-art models reveals systematic joint understanding failures: while models achieve 90%-plus accuracy on clear unimodal safety signals, performance degrades substantially to 20-55% when joint image-text reasoning is required to determine the safety label. Most critically, 34% of errors in joint image-text safety classification occur despite correct classification of the individual modalities, further demonstrating absent compositional reasoning capabilities. Additionally, we find that models struggle to balance refusing unsafe content while still responding to borderline cases that deserve engagement. For example, we find that instruction framing can reduce the over-blocking rate on borderline content from 62.4% to 10.4% in Gemini-1.5, but only at the cost of under-refusing on unsafe content with refusal rate dropping from 90.8% to 53.9%. Overall, our framework exposes weaknesses in joint image-text understanding and alignment gaps in current models, and provides a critical test bed to enable the next milestones in research on robust vision-language safety.
Read more →
Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.07498v2 Announce Type: replace-cross Abstract: Large language models (LLMs) increasingly support multilingual understanding and generation. Meanwhile, efforts to interpret their internal mechanisms have emerged, offering insights to enhance multilingual performance. While multi-head self-attention (MHA) has proven critical in many areas, its role in multilingual capabilities remains underexplored. In this work, we study the contribution of MHA in supporting multilingual processing in LLMs. We propose Language Attention Head Importance Scores (LAHIS), an effective and efficient method that identifies attention head importance for multilingual capabilities via a single forward and backward pass through the LLM. Applying LAHIS to Aya-23-8B, Llama-3.2-3B, and Mistral-7B-v0.1, we reveal the existence of both language-specific and language-general heads. Language-specific heads enable cross-lingual attention transfer to guide the model toward target language contexts and mitigate off-target language generation issue, contributing to addressing challenges in multilingual LLMs. We also introduce a lightweight adaptation that learns a soft head mask to modulate attention outputs over language heads, requiring only 20 tunable parameters to improve XQuAD accuracy. Overall, our work enhances both the interpretability and multilingual capabilities of LLMs from the perspective of MHA.
Read more →
Cross-Field Interface-Aware Neural Operators for Multiphase Flow Simulation
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.08625v2 Announce Type: replace-cross Abstract: Multiphase flow simulation is critical in science and engineering but incurs high computational costs due to complex field discontinuities and the need for high-resolution numerical meshes. While Neural Operators (NOs) offer an efficient alternative for solving Partial Differential Equations (PDEs), they struggle with two core challenges unique to multiphase systems: spectral bias caused by spatial heterogeneity at phase interfaces, and the persistent scarcity of expensive, high-resolution field data. This work introduces the Interface Information Aware Neural Operator (IANO), a novel architecture that mitigates these issues by leveraging readily obtainable interface data (e.g., topology and position). Interface data inherently contains the high-frequency features not only necessary to complement the physical field data, but also help with spectral bias. IANO incorporates an interface-aware function encoding mechanism to capture dynamic coupling, and a geometry-aware positional encoding method to enhance spatial fidelity for pointwise super-resolution. Empirical results across multiple multiphase flow cases demonstrate that IANO achieves significant accuracy improvements (up to $\sim$10\%) over existing NO baselines. Furthermore, IANO exhibits superior generalization capabilities in low-data and noisy settings, confirming its utility for practical, data-efficient $\text{AI}$-based multiphase flow simulations.
Read more →
A Machine Learning-Driven Solution for Denoising Inertial Confinement Fusion Images
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.16717v2 Announce Type: replace-cross Abstract: Neutron imaging is essential for diagnosing and optimizing inertial confinement fusion implosions at the National Ignition Facility. Due to the required 10-micrometer resolution, however, neutron image require image reconstruction using iterative algorithms. For low-yield sources, the images may be degraded by various types of noise. Gaussian and Poisson noise often coexist within one image, obscuring fine details and blurring the edges where the source information is encoded. Traditional denoising techniques, such as filtering and thresholding, can inadvertently alter critical features or reshape the noise statistics, potentially impacting the ultimate fidelity of the iterative image reconstruction pipeline. However, recent advances in synthetic data production and machine learning have opened new opportunities to address these challenges. In this study, we present an unsupervised autoencoder with a Cohen-Daubechies- Feauveau (CDF 97) wavelet transform in the latent space, designed to suppress for mixed Gaussian-Poisson noise while preserving essential image features. The network successfully denoises neutron imaging data. Benchmarking against both simulated and experimental NIF datasets demonstrates that our approach achieves lower reconstruction error and superior edge preservation compared to conventional filtering methods such as Block-matching and 3D filtering (BM3D). By validating the effectiveness of unsupervised learning for denoising neutron images, this study establishes a critical first step towards fully AI-driven, end-to-end reconstruction frameworks for ICF diagnostics.
Read more →
WavefrontDiffusion: Dynamic Decoding Schedule for Improved Reasoning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.19473v2 Announce Type: replace-cross Abstract: Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation.
Read more →
Escaping the Verifier: Learning to Reason via Demonstrations
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.21667v2 Announce Type: replace-cross Abstract: Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks -- Countdown, DeepMath, and Poetry Writing -- and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.
Read more →
Foundations of Quantum Granular Computing with Effect-Based Granules, Algebraic Properties and Reference Architectures
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.22679v2 Announce Type: replace-cross Abstract: This paper develops the foundations of Quantum Granular Computing (QGC), extending classical granular computing including fuzzy, rough, and shadowed granules to the quantum regime. Quantum granules are modeled as effects on a finite dimensional Hilbert space, so granular memberships are given by Born probabilities. This operator theoretic viewpoint provides a common language for sharp (projective) and soft (nonprojective) granules and embeds granulation directly into the standard formalism of quantum information theory. We establish foundational results for effect based quantum granules, including normalization and monotonicity properties, the emergence of Boolean islands from commuting families, granular refinement under Luders updates, and the evolution of granules under quantum channels via the adjoint channel in the Heisenberg picture. We connect QGC with quantum detection and estimation theory by interpreting the effect operators realizing Helstrom minimum error measurement for binary state discrimination as Helstrom type decision granules, i.e., soft quantum counterparts of Bayes optimal decision regions. Building on these results, we introduce Quantum Granular Decision Systems (QGDS) with three reference architectures that specify how quantum granules can be defined, learned, and integrated with classical components while remaining compatible with near term quantum hardware. Case studies on qubit granulation, two qubit parity effects, and Helstrom style soft decisions illustrate how QGC reproduces fuzzy like graded memberships and smooth decision boundaries while exploiting noncommutativity, contextuality, and entanglement. The framework thus provides a unified and mathematically grounded basis for operator valued granules in quantum information processing, granular reasoning, and intelligent systems.
Read more →
Probabilistic Fusion and Calibration of Neural Speaker Diarization Models
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2511.22696v3 Announce Type: replace-cross Abstract: End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated fusion and calibration techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, that fusion substantially improves over individual models, and that the Fuse-then-Calibrate ordering generally outperforms both calibrating before fusion and uncalibrated fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.
Read more →
MambaScope: Coarse-to-Fine Scoping for Efficient Vision Mamba
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.00647v2 Announce Type: replace-cross Abstract: Vision Mamba has emerged as a promising and efficient alternative to Vision Transformers, yet its efficiency remains fundamentally constrained by the number of input tokens. Existing token reduction approaches typically adopt token pruning or merging to reduce computation. However, they inherently lead to information loss as they discard or compress token representations. This problem is further exacerbated when the same fine-grained token processing is uniformly applied across all images regardless of visual complexity. We observe that not all inputs require fine-grained processing: simple images can be effectively handled at a coarse resolution, while only complex ones require refinement. Based on this insight, we propose MambaScope, an adaptive framework for efficient inference for Vision Mamba. MambaScope first performs coarse-grained inference by dividing the input image into large patches, significantly reducing token length and computation. When the model's prediction confidence is low, selected regions are re-processed at a finer resolution to recover essential visual details with minimal additional cost. This dynamic resolution assignment strategy allows MambaScope to allocate computation adaptively according to image complexity, achieving efficient processing without compromising accuracy. Experiments across various vision tasks demonstrate that MambaScope outperforms both the baseline Vision Mamba and state-of-the-art token reduction techniques in terms of accuracy and efficiency.
Read more →
Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.00882v3 Announce Type: replace-cross Abstract: Vision-Language Models (VLMs) exhibit significant performance plateaus in specialized domains like precision agriculture, primarily due to "Reasoning-Driven Hallucination" where linguistic priors override visual perception. A key bottleneck is the "Modality Gap": visual embeddings fail to reliably activate the fine-grained expert knowledge already encoded in model parameters. We propose "Look, Recite, Then Answer," a parameter-efficient framework that enhances VLMs via self-generated knowledge hints while keeping backbone models frozen. The framework decouples inference into three stages: (1) Look generates objective visual descriptions and candidate sets; (2) Recite employs a lightweight 1.7B router to transform visual cues into targeted queries that trigger candidate-specific parametric knowledge; (3) Answer performs parallel evidence alignment between descriptions and recited knowledge to select the most consistent label. On AgroBench, our method achieves state-of-the-art results, improving Weed Identification accuracy by 23.52% over Qwen2-VL-72B and surpassing GPT-4o without external search overhead. This modular design mitigates hallucinations by transforming passive perception into active, controllable knowledge retrieval
Read more →
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.01374v3 Announce Type: replace-cross Abstract: This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
Read more →
ZIP-RC: Optimizing Test-Time Compute via Zero-Overhead Joint Reward-Cost Prediction
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.01457v2 Announce Type: replace-cross Abstract: Large language models excel at reasoning but lack key aspects of introspection, including anticipating their own success and the computation required to achieve it. Humans use real-time introspection to decide how much effort to invest, when to make multiple attempts, when to stop, and when to signal success or failure. Without this, LLMs struggle to make intelligent meta-cognition decisions. Test-time scaling methods like Best-of-N drive up cost and latency by using a fixed budget of samples regardless of the marginal benefit of each one at any point in generation, and the absence of confidence signals can mislead people, prevent appropriate escalation to better tools, and undermine trustworthiness. Learned verifiers or reward models can provide confidence estimates, but do not enable adaptive inference and add substantial cost by requiring extra models or forward passes. We present ZIP-RC, an adaptive inference method that equips models with zero-overhead inference-time predictions of reward and cost. At every token, ZIP-RC reuses reserved or unused logits in the same forward pass as next-token prediction to output a joint distribution over final reward and remaining length -- no extra models, architecture change, or inference overhead. This full joint distribution is used to compute a sampling utility which is the linear combination of the expected maximum reward, total compute, and latency of set of samples if generated to completion. During inference, we maximize this utility with meta-actions that determine which prefix of tokens to continue or initiate sampling from. On mixed-difficulty mathematical benchmarks, ZIP-RC improves accuracy by up to 12% over majority voting at equal or lower average cost, and traces smooth Pareto frontiers between quality, compute, and latency. By providing real-time reward-cost introspection, ZIP-RC enables adaptive, efficient reasoning.
Read more →
A Diffusion Model Framework for Maximum Entropy Reinforcement Learning
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.02019v2 Announce Type: replace-cross Abstract: Diffusion models have achieved remarkable success in data-driven learning and in sampling from complex, unnormalized target distributions. Building on this progress, we reinterpret Maximum Entropy Reinforcement Learning (MaxEntRL) as a diffusion model-based sampling problem. We tackle this problem by minimizing the reverse Kullback-Leibler (KL) divergence between the diffusion policy and the optimal policy distribution using a tractable upper bound. By applying the policy gradient theorem to this objective, we derive a modified surrogate objective for MaxEntRL that incorporates diffusion dynamics in a principled way. This leads to simple diffusion-based variants of Soft Actor-Critic (SAC), Proximal Policy Optimization (PPO) and Wasserstein Policy Optimization (WPO), termed DiffSAC, DiffPPO and DiffWPO. All of these methods require only minor implementation changes to their base algorithm. We find that on standard continuous control benchmarks, DiffSAC, DiffPPO and DiffWPO achieve better returns and higher sample efficiency than SAC and PPO.
Read more →
Young children's anthropomorphism of an AI chatbot: Brain activation and the role of parent co-presence
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.02179v2 Announce Type: replace-cross Abstract: Artificial Intelligence (AI) chatbots powered by a large language model (LLM) are entering young children's learning and play, yet little is known about how young children construe these agents or how such construals relate to engagement. We examined anthropomorphism of a social AI chatbot during collaborative storytelling and asked how children's attributions related to their behavior and prefrontal activation. Children at ages 5-6 (N = 23) completed three storytelling sessions: interacting with (1) an AI chatbot only, (2) a parent only, and (3) the AI and a parent together. After the sessions, children completed an interview assessing anthropomorphism toward both the AI chatbot and the parent. Behavioral engagement was indexed by the conversational turn count (CTC) ratio, and concurrent fNIRS measured oxygenated hemoglobin in bilateral vmPFC and dmPFC regions. Children reported higher anthropomorphism for parents than for the AI chatbot overall, although AI ratings were relatively high for perceptive abilities and epistemic states. Anthropomorphism was not associated with CTC. In the right dmPFC, higher perceptive scores were associated with greater activation during the AI-only condition and with lower activation during the AI+Parent condition. Exploratory analyses indicated that higher dmPFC activation during the AI-only condition correlated with higher end-of-session "scared" mood ratings. Findings suggest that stronger perceptive anthropomorphism can be associated with greater brain activation related to interpreting the AI's mental states, whereas parent co-presence may help some children interpret and regulate novel AI interactions. These results may have design implications for encouraging parent-AI co-use in early childhood.
Read more →
COGNITION: From Evaluation to Defense against Multimodal LLM CAPTCHA Solvers
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.02318v2 Announce Type: replace-cross Abstract: This paper studies how multimodal large language models (MLLMs) undermine the security guarantees of visual CAPTCHA. We identify the attack surface where an adversary can cheaply automate CAPTCHA solving using off-the-shelf models. We evaluate 7 leading commercial and open-source MLLMs across 18 real-world CAPTCHA task types, measuring single-shot accuracy, success under limited retries, end-to-end latency, and per-solve cost. We further analyze the impact of task-specific prompt engineering and few-shot demonstrations on solver effectiveness. We reveal that MLLMs can reliably solve recognition-oriented and low-interaction CAPTCHA tasks at human-like cost and latency, whereas tasks requiring fine-grained localization, multi-step spatial reasoning, or cross-frame consistency remain significantly harder for current models. By examining the reasoning traces of such MLLMs, we investigate the underlying mechanisms of why models succeed/fail on specific CAPTCHA puzzles and use these insights to derive defense-oriented guidelines for selecting and strengthening CAPTCHA tasks. We conclude by discussing implications for platform operators deploying CAPTCHA as part of their abuse-mitigation pipeline.Code Availability (https://anonymous.4open.science/r/Captcha-465E/).
Read more →
Defense That Attacks: How Robust Models Become Better Attackers
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.02830v2 Announce Type: replace-cross Abstract: Deep learning has achieved great success in computer vision, but remains vulnerable to adversarial attacks. Adversarial training is the leading defense designed to improve model robustness. However, its effect on the transferability of attacks is underexplored. In this work, we ask whether adversarial training unintentionally increases the transferability of adversarial examples. To answer this, we trained a diverse zoo of 36 models, including CNNs and ViTs, and conducted comprehensive transferability experiments. Our results reveal a clear paradox: adversarially trained (AT) models produce perturbations that transfer more effectively than those from standard models, which introduce a new ecosystem risk. To enable reproducibility and further study, we release all models, code, and experimental scripts. Furthermore, we argue that robustness evaluations should assess not only the resistance of a model to transferred attacks but also its propensity to produce transferable adversarial examples.
Read more →
Fairy2i: Training Complex LLMs from Real LLMs with All Parameters in $\{\pm 1, \pm i\}$
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.02901v2 Announce Type: replace-cross Abstract: Large language models (LLMs) have revolutionized artificial intelligence, yet their massive memory and computational demands necessitate aggressive quantization, increasingly pushing representations toward the theoretical limit of a single bit. While complex-valued LLMs, such as iFairy, offer a superior chance for low-bit representation compared to real-valued counterparts, they require training from scratch, preventing the utilization of the vast ecosystem of pre-trained real-valued foundation models. Here we present Fairy2i, a universal framework that transforms pre-trained real-valued layers into an equivalent widely-linear complex form, enabling extremely low-bit quantization while reusing existing checkpoints. By proving a lossless mathematical equivalence between real and widely-linear maps, we convert standard Transformers into the complex domain and employ a phase-aware quantization scheme with a highly efficient codebook of fourth roots of unity. Furthermore, we introduce a recursive residual quantization mechanism that iteratively minimizes quantization error, allowing inference to proceed via efficient multiplication-free accumulation. We demonstrate that Fairy2i restores the performance of LLaMA-2 7B at an effective 2-bit precision to levels nearly comparable with full-precision baselines, significantly outperforming state-of-the-art real-valued binary and ternary quantization methods. This work bridges the gap between the representational efficiency of complex-valued arithmetic and the practical utility of pre-trained models, paving a new way for efficient inference on commodity hardware.
Read more →
MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.02906v2 Announce Type: replace-cross Abstract: Understanding high-resolution images remains a significant challenge for multimodal large language models (MLLMs). Recent study address this issue by dividing the image into smaller crops and computing the semantic similarity between each crop and a query using a pretrained retrieval-augmented generation (RAG) model. The most relevant crops are then selected to localize the target object and suppress irrelevant information. However, such crop-based processing can fragment complete objects across multiple crops, thereby disrupting the computation of semantic similarity. In our experiments, we find that image crops of objects with different sizes are better handled at different resolutions. Based on this observation, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework for high-resolution image understanding. To address the issue of semantic similarity bias caused by objects being split across different image crops, we propose a multi-resolution semantic fusion method, which integrates semantic similarity maps obtained at different resolutions to produce more accurate semantic information and preserve the integrity of target objects. Furthermore, to achieve direct localization of target objects at a global scale, we introduce an open-vocalbulary object detection (OVD) model that identifies object regions using a sliding-window approach.Experiments on high-resolution image understanding benchmarks using different MLLMs demonstrate the effectiveness of our approach.
Read more →
SMP: Reusable Score-Matching Motion Priors for Physics-Based Character Control
2025-12-05 00:00 | Source: ArXiv AI Research
arXiv:2512.03028v2 Announce Type: replace-cross Abstract: Data-driven motion priors that can guide agents toward producing naturalistic behaviors play a pivotal role in creating life-like virtual characters. Adversarial imitation learning has been a highly effective method for learning motion priors from reference motion data. However, adversarial priors, with few exceptions, need to be retrained for each new controller, thereby limiting their reusability and necessitating the retention of the reference motion data when training on downstream tasks. In this work, we present Score-Matching Motion Priors (SMP), which leverages pre-trained motion diffusion models and score distillation sampling (SDS) to create reusable task-agnostic motion priors. SMPs can be pre-trained on a motion dataset, independent of any control policy or task. Once trained, SMPs can be kept frozen and reused as general-purpose reward functions to train policies to produce naturalistic behaviors for downstream tasks. We show that a general motion prior trained on large-scale datasets can be repurposed into a variety of style-specific priors. Furthermore SMP can compose different styles to synthesize new styles not present in the original dataset. Our method produces high-quality motion comparable to state-of-the-art adversarial imitation learning methods through reusable and modular motion priors. We demonstrate the effectiveness of SMP across a diverse suite of control tasks with physically simulated humanoid characters. Video demo available at https://youtu.be/ravlZJteS20
Read more →
Alan Dye Comments on His Career Move in an Instagram Story
2025-12-04T16:31:34Z | Source: Daring Fireball
Straight/dumb quotation marks. Some default Instagram typeface. That period just hanging there, outside the closing quote. This is the post from the man who led Apple’s software design for a decade. Not to mention the gall to use any quote from Steve Jobs, let alone this particular one, which is enshrined by Apple on the wall outside Town Hall at the old Infinite Loop campus in Cupertino, and provides the title for the splendid book published (in a delightful interactive version on the web, and in gorgeous limited print editions) by the Steve Jobs Archive and LoveFrom. “Just figure out what’s next” for Alan Dye, after his supposedly wonderful accomplishments at Apple, is ... going to work for Meta? Jiminy H. Christ, that takes stones. ★
Read more →
Fragments Dec 4
2025-12-04T10:59:00-05:00 | Source: Martin Fowler
Rob Bowley summarizes a study from Carnegie Mellon looking on the impact of AI on a bunch of open-source software projects. Like any such study, we shouldn’t take its results as definitive, but there seems enough there to make it a handy data point. The key point is that the AI code probably reduced the quality of the code base - at least if static code analysis can be trusted to determine quality. And perhaps some worrying second-order effects This study shows more than 800 popular GitHub projects with code quality degrading after adopting AI tools. It’s hard not to see a form of context collapse playing out in real time. If the public code that future models learn from is becoming more complex and less maintainable, there’s a real risk that newer models will reinforce and amplify those trends, producing even worse code over time. ❄ ❄ ❄ ❄ ❄ Rob’s post is typical of much of the thoughtful writing on AI. We can see its short-term benefits, but worry about its long-term impact. But on a much deeper note is this lovely story from Jim Highsmith. Jim has turned 0x50, and has spent the last decade fighting Parkinson’s disease. To help him battle it he has two AI assisted allies. Between my neural implants and Byron’s digital guidance, I now collaborate with two adaptive systems: one for motion, one for thought. Neither replaces me. Both extend me. If you read anything on AI this week, make it be this. It offers a positive harbinger for our future and opens my mind to a whole different perspective of the role of AI in it ❄ ❄ ❄ ❄ ❄ Anthropic recently announced that it disrupted a Chinese state-sponsored operation abusing Claude Code. Jim Gumbley looks at the core lesson to learn from this, that we have to understand the serious risk of AI Jailbreaking New AI tools are able to analyze your attack surface at the next level of granularity. As a business leader, that means you now have two options: wait for someone else to run AI-assisted vulnerability detection against your attack surface, or run it yourself first. ❄ ❄ ❄ ❄ ❄ There’s plenty of claims that AI Vibe Coding can replace software developers, something that folks like me (perhaps with a bias) think unlikely. Gergely Orosz shared this tidbit Talked with an exec at a tech company who is obsessed with AI and has been for 3 years. Not a developer but company makes software. Uses AI for everything, vibe codes ideas. Here’s the kicker: Has a team of several devs to implement his vibe coded prototypes to sg workable I’d love to hear more about this (and similar stories) ❄ ❄ ❄ ❄ ❄ Nick Radcliffe writes about a month of using AI I spent a solid month “pair programming” with Claude Code, trying to suspend disbelief and adopt a this-will-be-productive mindset. More specifically, I got Claude to write well over 99% of the code produced during the month. I found the experience infuriating, unpleasant, and stressful before even worrying about its energy impact. Ideally, I would prefer not to do it again for at least a year or two. The only problem with that is that it “worked”. He stresses that his approach is the “polar opposite” of Vibe Coding. The post is long, and rambles a bit, but is worthwhile because he talks in detail about his workflow and how he uses the tool. Such posts are important so we can learn the nitty-gritty of how our programming habits are changing. ❄ ❄ ❄ ❄ ❄ Along similar lines is a post of Brian Chambers on his workflow, that he calls Issue-Driven Development (and yes, I’m also sick of the “something-driven” phraseology). As with much of the better stuff I’ve heard about AI assisted work, it’s all about carefully managing the context window, ensuring the AI is focused on the right things and not distracted by textual squirrels.
Read more →
★ Bad Dye Job
2025-12-04T03:16:33Z | Source: Daring Fireball
In my post earlier today on the then-breaking news that Alan Dye has left Apple to join Meta as chief design officer (a new title at the company1), I wrote: It sounds like Dye chose to jump ship, and wasn’t squeezed out (as it seems with former AI chief John Giannandrea earlier this week). Gurman/Bloomberg are spinning this like a coup for Meta (headline: “Apple Design Executive Alan Dye Poached by Meta in Major Coup”), but I think this is the best personnel news at Apple in decades. Dye’s decade-long stint running Apple’s software design team has been, on the whole, terrible — and rather than getting better, the problems have been getting worse. Dye’s replacement at Apple is longtime Apple designer Stephen Lemay. I’ve never met Lemay (or at least can’t recall meeting him), and prior to today never heard much about him. But that’s typical for Apple employees. Part of the job working for Apple is remaining under the radar and out of the public eye. What I’ve learned today is that Lemay, very much unlike Dye, is a career interface/interaction designer. Sources I’ve spoken to who’ve worked with Lemay at Apple speak highly of him, particularly his attention to detail and craftsmanship. Those things have been sorely lacking in the Dye era. Not everyone loves everything Lemay has worked on, but nobody bats 1.000 and designers love to critique each other’s work. I’ve chatted with people with criticisms of specific things Lemay has worked on or led at Apple (e.g. aspects of iPadOS multitasking that struck many of us as deliberately limiting, rather than empowering), but everyone I’ve spoken to is happy — if not downright giddy — at the news that Lemay is replacing Dye. Lemay is well-liked personally and deeply respected talent-wise. Said one source, in a position to know the choices, “I don’t think there was a better choice than Lemay.” The sentiment within the ranks at Apple is that today’s news is almost too good to be true. People had given up hope that Dye would ever get squeezed out, and no one expected that he’d just up and leave on his own. (If you care about design, there’s nowhere to go but down after leaving Apple. What people overlooked is the obvious: Alan Dye doesn’t actually care about design.) What I struggled with in the wake of today’s news is how to square the following contradiction: Dye apparently left for Meta on his own; he wasn’t squeezed out. Apple replacing Dye with Lemay seemingly signals a significant shift in direction, replacing a guy whose approach was almost entirely superficial/visual with a guy who’s spent his entire career sweating actual interaction details. If Apple’s senior leadership would have been happy to have Dye remain as leader of Apple’s software design teams, why didn’t they replace him with a Dye acolyte? Conversely, if the decision makers at Apple saw the need for a directional change, why wasn’t Dye pushed out?2 The answer, I think, is that the decision to elevate Lemay wasn’t about direction, but loyalty. Why risk putting in a Dye-aligned replacement when that person might immediately get poached too? We know, from this year’s AI recruitment battles, that Zuckerberg is willing to throw almost unfathomable sums of money to poach talent he wants to hire from competitors. Gurman reported that Billy Sorrentino, a Dye deputy who has served as a senior director of design at Apple since 2016, is leaving for Meta with Dye.3 I don’t have any other names, but word on the street is that other members of Dye’s inner circle are leaving Apple for Meta with him. But those who remain — or who might remain, if they’d have been offered the promotion to replace Dye — simply can’t be trusted from the perspective of senior leadership, who were apparently blindsided by Dye’s departure for Meta. They wouldn’t have given Dye a prime spot in the WWDC keynote if they thought he might be leaving within months. So the change in direction we may see — that many of us desperately hope to see — under Lemay’s leadership might be happenstance. More a factor of Lemay being politically safe, as someone predating Dye and outside Dye’s inner circle at Apple, than from Tim Cook or anyone else in senior leadership seeing a need for a directional change in UI design. But happenstance or not, it could be the best thing to happen to Apple’s HI design in the entire stretch since Steve Jobs’s passing and Scott Forstall’s ouster. Putting Alan Dye in charge of user interface design was the one big mistake Jony Ive made as Apple’s Chief Design Officer.4 Dye had no background in user interface design — he came from a brand and print advertising background. Before joining Apple, he was design director for the fashion brand Kate Spade, and before that worked on branding for the ad agency Ogilvy. His promotion to lead Apple’s software interface design team under Ive happened in 2015, when Apple was launching Apple Watch, their closest foray into the world of fashion. It might have made some sense to bring someone from the fashion/brand world to lead software design for Apple Watch, but it sure didn’t seem to make sense for the rest of Apple’s platforms. And the decade of Dye’s HI leadership has proven it. The most galling moment in Dye’s entire tenure was the opening of this year’s iPhone event keynote in September, which began with a title card showing the oft-cited Jobs quote “Design is not just what it looks like and feels like. Design is how it works.” The whole problem with the Dye era of HI design at Apple is that it has so largely — not entirely, but largely — been driven purely by how things look. There are a lot of things in Apple’s software — like app icons — that don’t even look good any more. But it’s the “how it works” part that has gone so horribly off the rails. Alan Dye seems like exactly the sort of person Jobs was describing in the first part of that quote: “People think it’s this veneer — that the designers are handed this box and told, ‘Make it look good!’” I am not a Liquid Glass hater. I actually think, on the whole, iOS 26 is a better and more usable UI than iOS 18. But MacOS 26 Tahoe is a mess, visually, and I’m not sure there’s a single thing about its UI that is better than MacOS 15 Sequoia. There are new software features in Tahoe that are excellent and serve as legitimate enticements to upgrade. But I’m talking about the user interface — the work from Alan Dye’s HI team, not Craig Federighi’s teams. I think the fact that Liquid Glass is worse on MacOS than it is on iOS is not just a factor of iOS being Apple’s most popular, most profitable, most important platform — and thus garnering more of Apple’s internal attention. I think it’s also about the fact that the Mac interface, with multiple windows, bigger displays, and more complexity, demands more nuanced, more expert, interaction design skills. Things like depth, layering, and unambiguous indications of input focus are important aspects of any platform. But they’re more important on the platform which, by design, shoulders more complexity. Back in 2010, predicting a bright future for the Mac at a time when many pundits were thinking Apple would soon put the entire platform out to pasture, I wrote, “It’s the heaviness of the Mac that allows iOS to remain light.” That remains as true today as it was 15 years ago. But Liquid Glass, especially as expressed on MacOS, is a lightweight poorly considered design system as a whole, and its conceptual thinness is not sufficient to properly allow the Mac to carry the weight it needs to bear. Perhaps more tellingly, there should have been no need for the “clear/tinted” Liquid Glass preference setting that Apple added in the 26.1 OS releases. Alan Dye wasn’t fired, by all accounts, but that preference setting was as good a sign as any that he should have been. And it’s very much a sign that inside Apple, there’s a strong enough contingent of people who prioritize how things work — like, you know, whether you can read text against the background of an alert — to get a setting like this shipped, outside the Accessibility section of Settings. It remains worrisome that Apple needed to luck into Dye leaving the company. But fortune favors the prepared, and Apple remains prepared by having an inordinate number of longtime talented HI designers at the company. The oddest thing about Alan Dye’s stint leading software design is that there are, effectively, zero design critics who’ve been on his side. The debate regarding Apple’s software design over the last decade isn’t between those on Dye’s side and those against. It’s only a matter of debating how bad it’s been, and how far it’s fallen from its previous remarkable heights. It’s rather extraordinary in today’s hyper-partisan world that there’s nearly universal agreement amongst actual practitioners of user-interface design that Alan Dye is a fraud who led the company deeply astray. It was a big problem inside the company too. I’m aware of dozens of designers who’ve left Apple, out of frustration over the company’s direction, to work at places like LoveFrom, OpenAI, and their secretive joint venture io. I’m not sure there are any interaction designers at io who aren’t ex-Apple, and if there are, it’s only a handful. From the stories I’m aware of, the theme is identical: these are designers driven to do great work, and under Alan Dye, “doing great work” was no longer the guiding principle at Apple. If reaching the most users is your goal, go work on design at Google, or Microsoft, or Meta. (Design, of course, isn’t even a thing at Amazon.) Designers choose to work at Apple to do the best work in the industry. That has stopped being true under Alan Dye. The most talented designers I know are the harshest critics of Dye’s body of work, and the direction in which it’s been heading. Back in June, after WWDC, I quoted from Alan Dye’s introduction of Liquid Glass during the keynote, and then quoted from Steve Jobs’s introduction of Aqua when he unveiled the Mac OS X Public Beta in January 2000. I wrote: Re-watching Jobs’s introduction of Aqua for the umpteenth time, I still find it enthralling. I found Alan Dye’s introduction of Liquid Glass to be soporific, if not downright horseshitty. One of the bits from Jobs’s Aqua introduction I quoted was this: This is what the top of windows look like. These three buttons look like a traffic signal, don’t they? Red means close the window. Yellow means minimize the window. And green means maximize the window. Pretty simple. And tremendous fit and finish in this operating system. When you roll over these things, you get those. You see them? And when you are no longer the key window, they go transparent. So a lot of fit and finish in this. After I published that post, I got a note from a designer friend who left Apple, in frustration, a few years ago. After watching Jobs’s Aqua introduction for the first time in years, he told me, “I’m really struck by Steve directly speaking to ‘radio buttons’ and ‘the key window’.” He had the feeling that Dye and his team looked down on interface designers who used terms like Jobs himself once used — in a public keynote, no less. That to Dye’s circle, such terms felt too much like “programmer talk”. But the history of Apple (and NeXT) user interface design is the opposite. Designers and programmers used to — and still should — speak the exact same language about such concepts. Steve Jobs certainly did, and something feels profoundly broken about that disconnect under Alan Dye’s leadership. It’s like the head of cinematography for a movie telling the camera team to stop talking about nerdy shit like “f-stops”. The head of cinematography shouldn’t just abide talking about f-stops and focal lengths, but love it. Said my friend to me, regarding his interactions with Dye and his team at Apple, “I swear I had conversations in which I mentioned ‘key window’ and no one knew what I meant.” That won’t be a problem with Stephen Lemay. Understanding of fundamental principles will no longer be lacking. Lemay has been at Apple spanning the gamut between the Greg Christie/Bas Ording glory days and the current era. At the very least, Lemay running HI should stop the bleeding — both in terms of work quality and talent retention. I sincerely believe things might measurably improve, but I’m more sure that things will stop getting worse. That alone will be a win for everyone — even though the change was seemingly driven by Mark Zuckerberg’s desire to poach Dye, not Tim Cook and Apple’s senior leadership realizing they should have shitcanned him long ago. Alan Dye is not untalented. But his talents at Apple were in politics. His political skill was so profound that it was his decision to leave, despite the fact that his tenure is considered a disaster by actual designers inside and outside the company. He obviously figured out how to please Apple’s senior leadership. His departure today landed as a total surprise because his stature within the company seemed so secure. And so I think he might do very well at Meta. Not because he can bring world-class interaction design expertise — because he obviously can’t — but because the path to success at Meta has never been driven by design. It’s about getting done what Zuck wants done. Dye might excel at that. Dye was an anchor holding Apple back, but might elevate design at Meta.5 My favorite reaction to today’s news is this one-liner from a guy on Twitter/X: “The average IQ of both companies has increased.” Titles are just titles, and title inflation is a real problem at all big companies. But I always thought C-level executives by definition report directly to the CEO. That that was the whole point of a “chief whatever officer” title versus “senior vice president of whatever”. But according to Mark Gurman’s exclusive report at Bloomberg breaking this whole story (emphasis added): With the Dye hire, Meta is creating a new design studio and putting him in charge of design for hardware, software and AI integration for its interfaces. He will be reporting to Chief Technology Officer Andrew Bosworth, who oversees Reality Labs. That group is tasked with developing wearable devices, such as smart glasses and virtual reality headsets. Dye’s major focus will be revamping Meta’s consumer devices with artificial intelligence features. If true, Dye doesn’t even report directly to Mark Zuckerberg. Oddly enough, after the retirement of COO Jeff Williams this year, Apple claimed the company’s design teams transitioned to reporting directly to CEO Tim Cook. ↩︎ And man oh man am I curious who was involved with this decision, who had Tim Cook’s ear, and just how quickly they were forced to make it. Part of what made Stephen Lemay a popular choice within Apple’s ranks is that Lemay, by all accounts I’ve heard, isn’t a political operator and never angled for a promotion to a level of this prominence. His focus has always singularly been on the work. ↩︎︎ Sorrentino was featured in a two-minute-plus segment in this year’s WWDC keynote, starting at the 38:25 mark, introducing the new iOS Visual Intelligence features. His star was rising at Apple. And Dye himself, of course, was given the spotlight to introduce and effectively take credit for Liquid Glass itself. At least until recently, no one at Apple saw this coming. ↩︎︎ I have good reason to believe that Ive, in private, would be the first person to admit that. A fan of Liquid Glass Jony Ive is not. I believe he sees Dye as a graphic designer, not a user interface designer — and not a good graphic designer at that. I don’t think Alan Dye could get a job as a barista at LoveFrom. ↩︎︎ It’s worth recalling that Zuckerberg sorta kinda tried this poach-design-talent-from-Apple thing before. Mike Matas, the wunderkind designer who became a sensation with Delicious Library in 2005, soon thereafter moved on to work at Apple, where he designed such things as the “slide to unlock” interface on the original iPhone. Matas was a key designer on that glorious first version of the iPhone’s OS. He then left Apple and formed Push Pop Press, and wound up at Facebook in 2011 after Facebook acquired Push Pop — before it had even shipped its core product. (I saw a still-in-development version of Push Pop’s publishing system in 2011, before Facebook bought them and shut down the product, and it remains to this day one of the most impressive, exciting, “this is the future” demos I’ve ever seen. It’s not merely a shame but a goddamn tragedy that it never even shipped.) Zuckerberg wound up assembling around Matas an entire little superteam of “Delicious” era designers and design-focused developers. That team wound up shipping Facebook Paper in 2014 — an iOS-exclusive alternative client for Facebook that espoused the same principles of elegance, exquisite attention to detail, and, especially, direct manipulation of content in lieu of user interface chrome, that infused Push Pop Press’s publishing system. Facebook Paper was so good it almost — almost — made me sign up for a Facebook account just so I could use it. But Facebook Paper went nowhere, fast. Zuckerberg lost his boner for “design”, Facebook Paper was pulled from the App Store in 2016, and the team behind Paper disbanded. Matas today works at LoveFrom, and remains, to my mind, one of the most singularly talented and interesting people in the field of interaction design. In some closer-to-ideal alternate universe, Matas would be running HI design at Apple today. ↩︎︎
Read more →
Congress told there needs to be “consequences” for NASA delays amid China’s rise - Ars Technica
2025-12-04 22:54 | Source: Technology News - Ars Technica
“The Artemis III mission and those beyond should be canceled.”…
Read more →
New Dawn of War 4 story trailer reveals surprise inclusion of Dark Angels sub-faction and everyone's favourite 10,000+ year-old grandpa - Eurogamer
2025-12-04 22:00 | Source: Technology News - Eurogamer.net
New Dawn of War 4 story trailer reveals surprise inclusion of Dark Angels sub-faction and everyone's favourite 10,000+ year-old grandpa.
Read more →
Liquid Swords announces ‘consequence-heavy noir action game’ Samson: A Tyndalston Story for PC - Gematsu
2025-12-04 21:52 | Source: Technology News - Gematsu
No content available
Read more →
tvOS 26.2 gets new RC for Apple TV 4K ahead of launch - 9to5Mac
2025-12-04 21:31 | Source: Technology News - 9to5Mac
Apple has just debuted a revised RC (release candidate) build for tvOS 26.2, available both for developers and public beta users.
Read more →
How to use GitHub Copilot Spaces to debug issues faster
2025-12-04 20:35 | Source: GitHub Engineering
Every developer knows this pain: you open an issue, and before you can write a single line of code, you’re hunting. You’re digging through old pull requests, searching for that design doc from three months ago, trying to remember which file has the security guidelines. That hunting phase? It takes forever. And it’s not even the actual work. And even if you want to bring AI into the picture, GitHub Copilot still needs the same thing you do: context. Without it, you get generic answers that don’t understand your codebase. GitHub Copilot Spaces fixes that. Spaces gives GitHub Copilot the project knowledge it needs—files, pull requests, issues, repos—so its responses are grounded in your actual code, not guesses. Are you a visual learner? Watch the full demo below. 👇 What is a space, again? Think of a space as a project knowledge bundle. You curate the files, docs, and decisions that matter for your project, and Copilot uses all of that when generating plans, explanations, or pull requests. You can: Add entire repositories or specific files, pull requests and issues (just paste the URL) Include text content like notes, video transcripts, or Slack messages Add design docs and architecture decisions Trigger Copilot coding agent directly from the space Use the space in your IDE through the GitHub MCP server The best part? Link it once and forget about it. Spaces automatically stay synced with the linked content. When your codebase updates, your space updates too. How to debug issues with spaces: 1. Start with an issue A contributor opened an issue reporting an unsafe usage of check_call in your project. As a maintainer, you might not know the best way to fix it immediately. On your own, you’d start by searching the repo, checking past pull requests, and combing through security guidelines just to figure out where to begin. With Spaces, you don’t have to do that manually. Create a space, add the issue and the key files or docs, and let Copilot reason across everything at once. 2. Create a space for your project Inside the space, add: Design patterns (e.g., /docs/security/check-patterns.md, /docs/design/architecture-overview.md) Security guidelines Accessibility recommendations The entire repository (for broad coverage) or a curated set of the most relevant files for your specific use case. Spaces work best when you’re intentional about what you include. The URL to the issue itself 3. Add Instructions for Copilot Each space includes an Instructions panel. This is where you tell Copilot how you want it to work inside your project. Here are some example instructions that will help with our task at hand: You are an experienced engineer working on this codebase. Always ground your answers in the linked docs and sources in this space. Before writing code, produce a 3–5 step plan that includes: - The goal - The approach - The execution steps Cite the exact files that justify your recommendations. After I approve a plan, use the Copilot coding agent to propose a PR. These instructions keep Copilot consistent. It won’t hallucinate patterns that don’t exist in your repo because you’ve told it to cite its sources. 🌟 Related reading: How to write a great agents.md Learn best practices for building effective custom agents for Copilot, based on an analysis of over 2,500 repositories. Get the guide > 4. Ask Copilot to debug the issue With everything set up, ask Copilot: “Help me debug this issue.” Copilot already knows which issue you mean because it’s linked to the space. It parses through all the sources, then returns a clear plan: Goal: Fix unsafe usage of runBinaryCheck to ensure input paths are validated. Approach: Search the repo for usages of runBinaryCheck Compare each usage to the safe pattern in the security docs Identify the required refactor Prepare a diff for each file with unsafe usage This isn’t a generic LLM answer. It’s grounded in the actual project context. 5. Generate the pull request Once you approve the plan, tell Copilot: “Propose code changes using Copilot coding agent.” The agent generates a pull request with: The before version and the after version An explanation of what changed References to the exact files that informed the fix The instructions that guided its choices Every file in the pull request shows which source informed the suggestion. You can audit the reasoning before you merge. 6. Iterate if you need to Not happy with something? Mention @copilot in the pull request comments to iterate on the existing pull request, or go back to the space to generate a fresh one. Keep working with Copilot until you get exactly what you need. 7. Share your space with your team Spaces are private by default. But you can share them with specific individuals, your entire team, or your whole organization (if admins allow it). Enterprise admins control who can share what, so you stay aligned with your company’s security policies. Use GitHub Copilot Spaces from your IDE Spaces are now available in your IDE via the GitHub MCP Server. Install the MCP server, and you can call your spaces directly from your editor. Same curated context, same grounded answers, but right where you’re already working. Being able to call a space from the IDE has been a game changer for me. It lets me stay focused without switching between the browser and my editor, which cuts out a ton of friction in debugging. Coming soon Here’s what’s on the roadmap: Public API Image support Additional file types like doc/docx and PDFs Three ways teams are using spaces right now 1. Code generation and debugging. Use spaces with Copilot coding agent to generate pull requests aligned with your patterns, security rules, and architecture. 2. Planning features. Link issues, design docs, and repos to plan features and draft requirements. Ask Copilot for a technical plan and it generates a pull request. 3. Knowledge sharing and onboarding. Spaces become living knowledge bases. New engineers onboard faster. Existing engineers stop answering the same questions repeatedly. Try it on your next issue Here’s my challenge to you: Create a GitHub Copilot Space. Add one issue and three to four key files. Add simple instructions. Ask Copilot to analyze the issue and propose a debugging plan. Approve the plan. Trigger the coding agent to generate a pull request. You’ll see exactly how much time you save when Copilot actually knows your project. Your AI assistant should never lack the right context. That’s what spaces are for. Want to see the full demo? Watch the GitHub Checkout episode on Copilot Spaces and try GitHub Copilot Spaces. The post How to use GitHub Copilot Spaces to debug issues faster appeared first on The GitHub Blog.
Read more →
The Outer Worlds 2 Gave Me Exactly What I Wanted From An RPG Inventory System And I Hated It - Kotaku
2025-12-04 20:30 | Source: Technology News - Kotaku
An RPG without encumbrance seemed like the holy grail, but it didn’t exactly turn out that way
Read more →
‘Samurai space opera’ action RPG SOL Shogunate announced for consoles, PC - Gematsu
2025-12-04 20:05 | Source: Technology News - Gematsu
Chaos Manufacturing, a studio founded by “veteran developers with decades of experience working on a wide range of acclaimed game franchises,” has announced SOL Shogunate a “samur…
Read more →
Hard Times Come To An End For 3 Zodiac Signs After December 5, 2025 - YourTango
2025-12-04 20:03 | Source: Technology News - YourTango
Hard times come to an end for Aries, Cancer and Scorpio zodiac signs after December 5, 2025, when the Moon moves into Cancer.
Read more →
With Graviton5, AWS Promises a 25% Performance Boost
2025-12-04 20:00 | Source: The New Stack
LAS VEGAS — At its annual re:Invent conference, AWS today launched the latest version of its Arm-based Graviton chips. The company promises that these new chips, which will feature 192 cores per chip (up from 96 in the last generation), will deliver up to 25% higher performance than the Graviton4 chips, which launched two years ago. In addition to higher speeds, the team also added a new layer to its Nitro hypervisor cards, the Nitro Isolation Engine, which now mathematically guarantees that different workloads are isolated from each other. Graviton5 Ali Saidi, a VP and Distinguished Engineer at AWS whose group develops these chips, told me that over 90,000 AWS customers now use the Graviton chips and 98% of the top 1,000 users of AWS’ EC2 compute service use them. As AWS CEO Matt Garman announced earlier this week, over 50% of the CPU capacity AWS added over the last few years was Graviton-based. “That’s a story that spans across our EC2 compute, where customers are getting their own compute and running their own workload, and also our managed services — Redshift serverless, is 90% Graviton, Elasticache, Amazon, Aurora, DocumentDB, all these things are greater than 50% Graviton today,” Saidi said. The Graviton logo. Credit: AWS. With Graviton5, the team focused on not just improving raw benchmark performance but also ensuring that those performance gains would also apply to real-world use cases. More cores help there, of course, but as Saidi noted, having those cores closer together also provides advantages in scalability and latency. For some workloads, that can mean between 30% and 40% improvements in performance, for example. He also noted that these workloads benefit from larger caches, with every core having access to 2.6x more L3 cache than the previous generation (which means the core ideally needs to spend far less time waiting for data to arrive to start its computations). The team also improved the network and storage bandwidth. Nitro Isolation Engine From a computer science perspective, the more interesting update today may actually not be the chip itself but the Nitro Isolation Engine. AWS has long promised that its Nitro system — its custom hardware virtualization system for EC2 — would sandbox different workloads and ensure that no information could leak between them. This is the sixth generation of Nitro cards and for the first time, the team decided to compartmentalize the functions of the hypervisor even more. “We said: could we take the code that manipulates things like the page tables and handles guest state and put it in its own really thin layer?” Saidi explained. That new layer was built in Rust, which itself promises enhanced memory and concurrency safety. But more importantly, since the team started from zero, it worked with AWS’s automated reasoning group to, from day one, make formal verification an integral part of the development process. “It’s not providing anything more than that hypervisor does in terms of guest confidentiality,” Saidi explained. “But we’re able to say: look, this is how we’ve raised the bar in doing this. This is how we’re trying to improve transparency, of showing you how we’ve used formal verification to actually saying that, yes, we are keeping guest content isolated from each other and isolated from us.” The post With Graviton5, AWS Promises a 25% Performance Boost appeared first on The New Stack.
Read more →
Why won’t Steam Machine support HDMI 2.1? Digging in on the display standard drama. - Ars Technica
2025-12-04 19:53 | Source: Technology News - Ars Technica
Valve tells Ars its “trying to unblock” limits caused by open source driver issues.
Read more →
How to get your 2025 Discord Checkpoint recap - Dexerto
2025-12-04 19:50 | Source: Technology News - Dexerto
Discord has begun rolling out its first-ever recap feature for 2025, taking notes from the likes of Spotify and YouTube.
Read more →
"In a lot of situations, it's straight up double" - Path of Exile 2 is about to get a significant performance boost, especially on PS5 and Xbox S/X - Eurogamer
2025-12-04 19:30 | Source: Technology News - Eurogamer.net
Action role-playing game Path of Exile 2 is about to get a significant performance boost on all platforms - one that, a…
Read more →
New report tells us when the Samsung Galaxy Watch Ultra 2 is coming - GSMArena.com news - GSMArena.com
2025-12-04 19:21 | Source: Technology News - GSMArena.com
It will be the true successor to the original from 2024. Samsung launched the Galaxy Watch Ultra in 2024 and then followed it up this year with the Galaxy...
Read more →
KubeCon Survey: How Platform Teams Are Adopting AI and IDPs
2025-12-04 19:00 | Source: The New Stack
Platform engineering continues to be central to the success of software organizations. During KubeCon in November, we surveyed 219 platform engineers on what’s top of mind. The survey responses give a glimpse into where platform teams sit in the organization, who they serve and what they are prioritizing next. Platform’s Primary Customer Is Still Developers What Steve Ballmer said 25 years ago still holds true: “developers, developers, developers, developers, developers…developers.” If there’s one message that comes through loud and clear in survey responses it’s this: Platform engineering is still, above all, a developer product team. Developers dominate as the primary customer: 44.1% of respondents selected developers as their main “customer.” SRE / Infra teams come next at 20.2%, reinforcing that many platform teams also carry a reliability and operational enablement mission. The customer base is widening: Data teams (13%) and security teams (10.4%) are also meaningful stakeholders. Drilling into the second two bullets: Even when the platform serves data, security or infrastructure directly, the shared outcome is still a faster, safer path to production for developers. This comes in the form of automating security tasks, database-focused operations and driving efficiencies throughout the broader software delivery life cycle (SDLC). The mix of “customers” hints at a shift in how platform teams think about their raison d’être. The platform isn’t just a paved road for shipping code; it has become a shared internal product, one that must meet multiple disciplines without losing sight of developer outcomes. As those customer groups grow, so does the need for clear ownership, discoverability, guardrails and standards across services, infra, data assets and policies. All in the name of making sure the developer can more effectively do their primary job (ship great code). AI Isn’t Quite Everywhere [Yet] The survey shows AI is still nascent throughout many organizations. Though, AI has crossed the line from “interesting someday” to “we need this now.” In fact, nearly all respondents highlighted AI exploration as a top 2026 priority. What are they trying first? AI code assistants lead: 30.1% of platform teams are exploring how to use AI coding assistants to better enable developers in 2026. Next wave: AI code reviews (17.8%) and automated ticket resolution (13.4%). Reliability automation is rising: Self-healing incidents (12.2%), test generation (10.2%) and auto-fixing vulnerabilities (9.6%) show a hefty appetite for AI that reduces toil. This pattern is evidence that teams are testing AI where it most directly elevates developer throughput and cuts operational drag. Organizations are thinking about a “better together” story for humans plus AI to develop compelling agentic engineering use cases. Over time, these experiments will push platform teams to standardize how AI tools are governed, integrated and measured. IDPs Are Becoming the Operating Layer IDPs are no longer an emerging concept, but adoption is uneven. Nearly 60% of respondents report using an IDP today. There is fragmentation in how IDPs are used. When asked how developer outcomes are measured, 43% of respondents stated they have no formalized system to collect feedback from engineering to identify friction points. Surveys are the most common way to measure developer success, followed by specific metrics and interviews. What is interesting is that platform teams define their own success metrics based on developeroutcomes. Reduction of tickets created for DevOps/infra (16.5%) and Deployment Frequency (13.6%) are the top two KPIs platform teams hold themselves accountable to. Having an IDP is important to streamline this measurement and reduce tool sprawl; making the platform teams’ life easier which manifests for better developer outcomes. Conclusion IDPs remain a critical tool for modern engineering orgs: They streamline workflows and bring order to a chaotic SDLC. But in the AI era, delivery is even more chaotic and consequential. Without a platform layer that accounts for and enables AI-driven work, teams risk invisible bottlenecks, brittle automation, shadow AI exposure, and slower delivery. The IDP is evolving to meet that moment. This evolution comes in the form of agentic engineering platforms (AEPs) that extend the portal into the AI segment of the SDLC, supporting not just code creation, but self-healing incidents, vulnerability remediation, agentic impact measurement and standards enforcement at scale. The post KubeCon Survey: How Platform Teams Are Adopting AI and IDPs appeared first on The New Stack.
Read more →
Overview of Exclusive The Sims 4 Bikini Bottom Bundle Items - Sims Community
2025-12-04 18:52 | Source: Technology News - Simscommunity.info
For the first time ever The Sims Team has decided to do an exclusive Items drop within a Kits release. That's right, it's not just enough that there are two
Read more →
Netflix quietly does away with the easiest way to watch TV in a hotel room - SFGATE
2025-12-04 18:44 | Source: Technology News - SFGate
Netflix retired a feature of its platform that may have made it easier for some travelers to log into the streaming giant in a hotel room.
Read more →
Destiny 2 Renegades Launch Hits Lowest Peak Player Count In Franchise History - DualShockers
2025-12-04 18:43 | Source: Technology News - DualShockers
The galactic update sets an unwanted record, as Destiny 2 Renegates hits the lowest player count in the franchise.
Read more →
Review: Octopath Traveler 0 (Switch 2) - A Bit Of A Retread, But Unmissable (And Enormous) - Nintendo Life
2025-12-04 18:35 | Source: Technology News - Nintendo Life
Zero to hero
Read more →
Yakuza 0 Director's Cut makes a great game worse while erasing gaming history - AV Club
2025-12-04 18:12 | Source: Technology News - The A.V. Club
Yakuza 0 Director's Cut is out on Dec. 8. The issue? Its changes are ill-considered, and it's replacing the original on digital storefronts.
Read more →
Stop Blaming React for Your State Management Hangover
2025-12-04 18:00 | Source: The New Stack
Every time a React app misbehaves, the first tweet is some variation of “React sucks.” No, it isn’t (okay, who am I kidding, maybe a little bit) but what’s really breaking is your mental model of state. Developers keep reaching for new state management libraries the way hungover people reach for greasy food: hoping it’ll fix what’s fundamentally self-inflicted. Zustand, Jotai, Recoil, Valtio — they’re all great tools. But none of them can save you from chaos if you don’t understand how data moves through your app. React isn’t your scapegoat: your state architecture is the main culprit here. The Addiction to Shiny State Management Solutions The React ecosystem breeds new state management libraries and approaches faster than npm can warn you about vulnerabilities. Every few months, a new one trends on X, promising simplicity, performance and an end to boilerplate. Developers rush to install it, convinced this time they’ve found the one. The honeymoon lasts until the first prop drilling conflict or synchronization bug. Then it’s back to blaming React — again. But these libraries don’t solve the root issue: unclear data flow. Developers layer global stores, contexts and hooks without ever asking why the data lives where it does. They’re duct-taping logic onto the framework instead of designing an architecture. When everything updates everything else, you’ve built a minefield, not a UI. You can’t architect clarity by outsourcing thinking to the latest library. What React gives you is composability. What you do with it determines whether your app feels elegant or brittle. You can’t architect clarity by outsourcing thinking to the latest library. You do it by understanding unidirectional data flow — React’s core principle — and sticking to it. Understanding Context Overload and the Provider Pyramid If your component tree looks like the inside of a Matryoshka doll, you’re not alone. The “Provider Pyramid,” where half your app lives inside overlapping contexts, is the new callback hell. Everyone’s chasing global state convenience, but context isn’t a silver bullet. It’s a scalpel: powerful when used precisely, disastrous when overapplied. Developers often wrap everything in context because it feels like shared state nirvana. But each provider introduces complexity. Debugging nested contexts becomes an archeological dig through useContext calls. Performance suffers because re-renders cascade through the hierarchy. The truth is, most data doesn’t need to be global. And no, switching to Zustand won’t magically fix that. You’re still synchronizing state at the wrong granularity. Not to mention, if you’re running instances then container security is another thing you have to worry about and take seriously. It’s safe to say it’s not the easiest kerfuffle I’ve found myself in. The truth is, most data doesn’t need to be global. A shopping cart? Sure. Theme preferences? Maybe. But that “currently selected tab” or “temporary filter” state? Keep it local. The moment you globalize everything, you’ve lost control of your mental model. React encourages local reasoning — respect that boundary. Why Redux Wasn’t the Villain You Thought It Was Redux became the punching bag of React fatigue; but in hindsight, it wasn’t the villain. It just made your architecture honest. Redux forced developers to think about data flow, action semantics and immutability. That discipline was painful, but it exposed where logic actually lived. The real issue wasn’t Redux; it was how teams abused it. Many used Redux as a dumping ground for every variable — from authentication to whether a modal was open. The result was a global spaghetti bowl of actions and reducers no one understood. Then came the wave of “Redux is too complex” think pieces, conveniently ignoring that the complexity came from treating Redux like a database, not a coordination layer. Modern tools abstract away the boilerplate, but they don’t remove the need for mental discipline. Modern tools abstract away the boilerplate, but they don’t remove the need for mental discipline. Whether you’re using Zustand, MobX, or React Query, the same principle applies: state belongs where it’s most meaningful. Global state should be the exception, not the default. You don’t need fewer libraries; you need fewer excuses. The Mirage of Simplicity in React Hooks React hooks were supposed to simplify things. Instead, they became a new hiding place for architectural sins. Custom hooks are great for abstraction, but when you start nesting them like Russian dolls, you’re creating invisible coupling. Each use hides dependencies and timing issues that only surface in production — when your component tree starts acting possessed. The seductive thing about hooks is that they feel composable. But composition without discipline is just chaos in layers. The mental cost of understanding where state changes originate multiplies fast. You end up with a dozen hooks sharing state in slightly different ways — each re-render triggering the others like dominoes. Simplicity isn’t about fewer lines of code; it’s about predictability. The fewer mental hops between cause and effect, the saner your app will be. Before you write another useGlobalStore, ask if your hook really needs to exist. Most of the time, you can solve it with props and a clear hierarchy. How to Scale Your React App Without Losing Your Mind Every React project starts clean. Then reality hits: more features, more components, more developers. Suddenly, state’s flowing like an unregulated river. That’s when teams panic and bring in a new library. But scaling isn’t about tools — it’s about patterns. Colocate state with the components that use it. Pass data down deliberately, not reflexively. Use derived state instead of duplicating sources of truth. Split context providers by domain, not convenience. These principles aren’t trendy; they’re timeless. You can scale a React app without turning it into a dependency labyrinth if you treat architecture as a living system, not a patchwork. Even at scale, most React chaos comes from neglecting fundamentals. Even at scale, most React chaos comes from neglecting fundamentals. Don’t reach for complexity when clarity will do. The frameworks evolve, the syntax changes, but the laws of clean architecture never go out of style. React doesn’t demand perfection — just consistency. The Framework Isn’t the Problem, Your Architecture Is Blaming React for state headaches is like blaming your car for bad driving: why not just switch cars and stop complaining? The framework does exactly what you tell it to do. If your components are thrashing, your context layers overgrown, or your hooks indistinguishable from black magic, that’s on you. React is opinionated about one thing: data flows down. Everything else — side effects, synchronization and caching — is your responsibility. That’s not a bug; it’s a feature. It forces you to build with intention. When you abdicate that responsibility to whatever’s trending on GitHub, you trade understanding for temporary relief. Frameworks don’t create chaos; developers do. You don’t need to rewrite your app in Solid, Svelte, or Vue. You need to stop duct-taping abstractions onto architecture you never fully designed. Frameworks don’t create chaos; developers do. Once you accept that, React stops being a pain and starts being a partner. Conclusion React isn’t broken. Your architecture is. The endless cycle of swapping libraries, reinventing patterns and blaming the framework only masks the truth: state management is hard because thinking clearly is hard. The solution isn’t another hook or global store; it’s humility and discipline. Understand how your data flows, design your state intentionally, and React will stop feeling like an adversary. Stop blaming React for your hangover. You poured the drinks. The post Stop Blaming React for Your State Management Hangover appeared first on The New Stack.
Read more →
In 1995, a Netscape employee wrote a hack in 10 days that now runs the Internet - Ars Technica
2025-12-04 17:59 | Source: Technology News - Ars Technica
Thirty years later, JavaScript is the glue that holds the interactive web together, warts and all.
Read more →
Amazon’s new Kindle Scribe and Kindle Scribe Colorsoft launch on December 10 - TechCrunch
2025-12-04 17:52 | Source: Technology News - TechCrunch
The new Kindle Scribe has a larger 11-inch glare-free display, is just 5.4 mm thick, and weighs only 400 g. It’s also 40% faster when writing or turning pages.
Read more →
Can you survive the apocalypse? The Lords of the End Times are coming to Total War: Warhammer III - Warhammer Community
2025-12-04 17:03 | Source: Technology News - Warhammer-community.com
New Legendary Lords inbound!
Read more →
KubeVirt’s Architecture: CRDs, Controllers and Daemons
2025-12-04 16:00 | Source: The New Stack
This is an excerpt from Chapter 3 of “Running Virtual Machines on Kubernetes: A Practical Roadmap for Enterprise Migrations,” a new eBook by acclaimed research analyst and technology expert Janakiram MSV and sponsored by Spectro Cloud. From exploring the architecture and life cycle of virtual machines (VMs) in a cloud native environment, to building cross-functional migration teams and selecting the right tools, this free book, now available for download, helps enterprise leaders navigate this once-in-a-generation shift with confidence. KubeVirt Fundamentals: Bridging VMs and Containers As organizations chart their course away from traditional virtualization, KubeVirt emerges not just as a tool but as a foundational technology that makes a phased, pragmatic migration to Kubernetes possible. It acts as a bridge, enabling the coexistence of legacy virtual machines and modern containers on a single, unified platform: Kubernetes. Understanding KubeVirt’s architecture and capabilities is the first step in leveraging it to derisk the migration process, consolidate infrastructure and accelerate the journey to a cloud native operating model. This chapter explores the technical foundations, practical limitations and real-world implementation patterns that are essential for infrastructure evaluation. Architecture Overview: How KubeVirt Extends Kubernetes KubeVirt’s design philosophy is straightforward and builds firmly on aspects where Kubernetes already excels. Instead of creating a new, parallel orchestration system for virtual machines, KubeVirt extends the reputable Kubernetes API and control plane, enabling it to manage VMs as native resources. It efficiently delegates core functions like scheduling, networking and storage directly to Kubernetes, while layering on the specific logic required for virtualization. KubeVirt adds virtualization capabilities to Kubernetes. At its heart, a KubeVirt VM is simply a process running inside a standard Kubernetes pod. This approach allows VMs and containers to run side by side on the same worker nodes, communicate over the same network and use the same storage resources, all managed through a single pane of glass. To achieve this, KubeVirt introduces three main types of components into the cluster: Custom Resource Definitions (CRDs): These are extensions to the Kubernetes API that define new object types. KubeVirt adds several CRDs, most notably VirtualMachine and VirtualMachineInstance (VMI). This allows administrators to define a VirtualMachine using a declarative YAML manifest, just as they would for any other Kubernetes object, such as a pod. Controllers: These are cluster-wide components that contain the business logic for managing the new CRDs. They run as pods and watch the Kubernetes API for changes. Daemons: These are node-specific agents, deployed as a DaemonSet, that are responsible for managing the VM life cycle on each worker node in the cluster. Key Components and Their Roles The interplay between KubeVirt’s components creates a seamless virtualization layer within Kubernetes. While an operator can install all necessary components, understanding the individual roles of these components is key to troubleshooting and effective management. VirtualMachine and VMI: These are the two primary CRDs that users interact with. The VirtualMachine object represents the persistent, desired state of a virtual machine. It can be started and stopped while retaining its configuration and data. The VirtualMachineInstance represents the actual running instance of that VirtualMachine. A VMI is more ephemeral, existing only while the VirtualMachine object is in a running state, and is tightly coupled to the pod that hosts it. virt-api server: This serves as the HTTP API entry point for all virtualization flows, acting as an interface for the operations of VMI CRDs. It validates, processes and persists VMI and VirtualMachine resource definitions into Kubernetes, allowing the rest of the KubeVirt control plane to react. virt-controller: This is the central, clusterwide controller. Its primary job is to watch for the creation of new VMI objects. When a VMI is defined, virt-controller creates a corresponding pod that will ultimately host the VirtualMachine process. It handles high-level operations and orchestrates complex actions, such as live migrations. virt-handler: This is a DaemonSet, meaning an instance that runs on every worker node. It acts as the node-specific agent. When a VM’s pod is scheduled onto its node, virt-handler takes over. It communicates with the virt-launcher inside the pod to perform all the necessary operations to start, stop and manage the VM process on that specific host. virt-launcher: For every running VM, there is a dedicated pod, and the primary container within that pod runs the virt-launcher component. This component is the final link in the chain. It receives instructions from virt-handler and uses a local libvirtd instance to start and manage the actual QEMU/Kernel-based Virtual Machine (KVM) process that constitutes the virtual machine. It also ensures a graceful shutdown by trapping signals from Kubernetes and passing them to the VM process. libvirtd: This is a hypervisor management daemon running inside the virt-launcher container. It exposes a control interface to QEMU/KVM, handling VM life-cycle commands such as start, stop, pause, resume and migrate. It abstracts away the complexities of interacting directly with QEMU by offering a stable API. QEMU: This is a user-space emulator and virtualizer invoked by libvirtd inside the virt-launcher container. QEMU emulates the VM’s hardware environment and executes the guest operating system with hardware acceleration through KVM when available. It handles device emulation, I/O operations and CPU virtualization. Communication and storage of additional controllers and daemons. To read more, download “Running Virtual Machines on Kubernetes: A Practical Roadmap for Enterprise Migrations” today! The post KubeVirt’s Architecture: CRDs, Controllers and Daemons appeared first on The New Stack.
Read more →
Your CI/CD Pipeline Is Not Ready To Ship AI Agents
2025-12-04 15:00 | Source: The New Stack
Let’s be honest with ourselves for a minute. If you look past the hype cycles, the viral Twitter demos and the astronomical valuation of foundation model companies, you will notice a distinct gap in the AI landscape. We are incredibly early, and our infrastructure is failing us. While every SaaS company has slapped a copilot sidebar onto its UI, actual autonomous agents are rare in the wild. I am referring to software that reliably executes complex and multistep tasks without human hand-holding. Most agents today are internal tools glued together by enthusiastic engineers to summarize Slack threads or query a SQL database. They live in the safe harbor of internal usage where a 20% failure rate is a quirky annoyance rather than a churn event. Why aren’t these agents facing customers yet? It is not because the models lack intelligence. It is because our delivery pipelines lack rigor. Taking an agent from cool demo to production-grade reliability is an engineering nightmare that few have solved because traditional CI/CD pipelines simply were not designed for non-deterministic software. We are learning the hard way that shipping agents is not an AI problem. It is a systems engineering problem. Specifically, it is a testing infrastructure problem. The Death of ‘Prompt and Pray’ For the last year, the industry has been obsessed with frameworks that promised magic. You give the framework a goal and it figures out the rest. This was the “prompt and pray” era. But as recent discussions in the engineering community highlight, specifically the insightful conversation around 12-Factor Agents, production reality is boringly deterministic. The developers actually shipping reliable agents are abandoning the idea of total autonomy. Instead, they are building robust and deterministic workflows where large language models (LLMs) are treated as fuzzy function calls injected at specific leverage points. When you strip away the block-box magic of the LLM, a production-grade agent starts to look a lot like a conventional microservice. It has a control flow, state and dependencies. It needs to interact with the world to be useful. The 12-Factor philosophy correctly argues that you must own your control flow. You cannot outsource your logic loop to a probabilistic model. If you do, you end up with a system that works 80% of the time and hallucinates itself into a corner the other 20%. So we build the agent as a workflow. We treat the LLM as a component rather than the architect. But once we settle on this architecture, we run headfirst into a wall that traditional software engineering solved a decade ago but which AI has reopened. That wall is integration testing. The Trap of Evals When teams start testing agents, they almost always start with evals. Evals are critical. You need frameworks to score your LLM outputs for relevance, toxicity and hallucinations. You need to know if your prompt changes caused a regression in reasoning. However, in the context of shipping a product, evals are essentially unit tests. They test the logic of the node, but they do not test the integrity of the graph. In a production environment, your agent is not chatting in a void. It is acting. It is calling tools. It is fetching data from a CRM, updating a ticket in Jira or triggering a deployment via an MCP (Model Context Protocol) server. The reliability of your agent is not just defined by how well it writes text or code. It is defined by how consistently it handles the messy and structured data returned by these external dependencies. The Integration Nightmare This is where the platform engineering headache begins. Imagine you have an agent designed to troubleshoot Kubernetes pod failures. To test this agent, you cannot just feed it a text prompt. You need to put it in an environment where it can do several things. It must call the Kubernetes API or an MCP server wrapping it. It must receive a JSON payload describing a CrashLoopBackOff. It must parse that payload. It must decide to check the logs. Finally, it must call the log service. Source: Signadot If the structure of that JSON payload changes, or if the latency of the log service spikes, or if the MCP server returns a slightly different error schema, your agent might break. It might hallucinate a solution because the input context did not match its training examples. To test this reliably, you need integration testing. But integration testing for agents is significantly harder than for standard web apps. Why Traditional Testing Tails In traditional software development, we mock dependencies. We stub out the database and the third-party APIs. But with LLM agents, the data is the control flow. If you mock the response from an MCP server, you are feeding the LLM a perfect and sanitized scenario. You are testing the happy path. But LLMs are most dangerous on the unhappy path. You need to know how the agent reacts when the MCP server returns a 500 error, an empty list or a schema with missing fields. If you mock these interactions, you are writing the test to pass rather than to find bugs. You are not testing the agent’s ability to reason. You are testing your own ability to write mocks. The alternative to mocking is usually a full staging environment where you spin up the agent, the MCP servers, the databases and the message queues. But in a modern microservices architecture, spinning up a duplicate stack for every pull request is prohibitively expensive and slow. You cannot wait 45 minutes for a full environment provision just to test if a tweak to the system prompt handles a database error correctly. The Need for Ephemeral Sandboxes To ship production-grade agents, we need to rethink our CI/CD pipeline. We need infrastructure that allows us to perform high-fidelity integration testing early in the software development life cycle. We need ephemeral sandboxes. A platform engineer needs to provide a way for the AI developer to spin up a lightweight, isolated environment that contains: The version of the agent being tested. The specific MCP servers and microservices it depends on. Access to real (or realistic) data stores. Crucially, we do not need to duplicate the entire platform. We need a system that allows us to spin up the changed components while routing traffic intelligently to shared and stable baselines for the rest of the stack. Source: Signadot This approach solves the data fidelity problem. The agent interacts with real MCP servers running real logic. If the MCP server returns a complex JSON object, the agent has to ingest it. If the agent makes a state-changing call like restart pod, it actually hits the service or a sandboxed version of it. This ensures the loop is closed. This is the only way to verify that the workflow holds up. Shifting Left on Agentic Reliability The future of AI agents is not just better models. It is better DevOps. If we accept that production agents are just software with fuzzy logic, we must accept that they require the same rigor in integration testing as a payment gateway or a flight control system. We are moving toward a world where the agent is just one microservice in a Kubernetes cluster. It communicates via MCP to other services. The challenge for platform engineers is to give developers the confidence to merge code. That confidence does not come from a green checkmark on a prompt eval. It comes from seeing the agent navigate a live environment, query a live MCP server and execute a workflow successfully. Conclusion Building the agent is the easy part. Building the stack to reliably test the agent is where the battle is won or lost. As we move from internal toys and controlled demos to customer-facing products, the teams that win will be those that can iterate fast without breaking things. They will be the teams that abandon the idea of “prompt and pray” and instead bring production fidelity to their pull request (PR) review. This requires a specific type of infrastructure focused on request-level isolation and ephemeral testing environments that work natively within Kubernetes. Solving this infrastructure gap is our core mission at Signadot. We allow platform teams to create lightweight sandboxes to test agents against real dependencies without the complexity of full environments. If you are refining the architecture for your AI workflows, you can learn more about this testing pattern at signadot.com. The post Your CI/CD Pipeline Is Not Ready To Ship AI Agents appeared first on The New Stack.
Read more →
Mooncake Brings Databricks Rich Transactional Processing
2025-12-04 14:00 | Source: The New Stack
All those AI Agents will that will soon be swarming about will need fresh data, which is causing the data platform community to urgently think about ways to better inject analytics directly into decision-making processes. In October, Databricks quietly acquired a technology that will provide a crucial piece to its emerging Lakebase platform for AI agents: Mooncake, a single package that supports both rich transactional processing and fast columnar analysis. Selling point? No ETL pipelines to manage. From within PostgreSQL itself, data can be tapped into for making routing decisions in the transaction process. Lakebase is a serverless Postgres service integrated into the company’s Lakehouse managed data platform. It is optimized for AI agents (especially the company’s own Agent Bricks). Databricks purchased serverless PostgreSQL provider Neon in May for $1 billion. This gave the company a PostgreSQL-based transactional platform, one that, according to Databricks, decoupled compute from storage. The next piece of the puzzle: Mooncake. OLTP and OLAP: Torn Asunder Mooncake was developed by Mooncake Labs, a start-up by three ex-SingleStore engineers to rethink how a combined transactional and analytics database system might operate. Traditionally, transactional database systems (OLTP) and analytics database systems (OLAP) have been run separately from one another (and often by separate departments) within the enterprise. The commonly-held fear has been that the latency time of transactional processing — which needs to be fast — would be compromised by some long and/or computationally-heavy analytics jobs running on large data sets. So put OLTP, with its microsecond insert times needed for speedy transactions, over here; and the OLAP system, with its ability to scan massive tables for large-scale analysis, over yonder. This separation has since become burdensome. Because the two need to exchange data. “The users are forced to manually duct tape them together with complex and fragile data pipelines that takes hours to sync and sometimes transform data into something that’s hard to read,” explained Mooncake Labs co-founder Cheng Chen, in a lecture at Carnegie Mellon University’s Database Group’s Future Data Systems Seminar Series. Network speeds and computational heft have come to such where combining OLTP and OLAP could be a good idea, in that it opens a whole new vista of how transactions can be handled. OLTP and OLAP: Together Forever Chen was one of three co-founders who came from SingleStore, which offers a Hybrid Transactional/Analytical Processing (HTAP) database system of the same name (formerly MemSQL). A distributed database system, SingleStore unifies transactional and columnar analytics, as a way to combine these two types of data stores. With a single engine, it uses working memory for transactional rows and disk for column storage. It scales well, and can support multiple formats such as JSON, full-text and vector. But SingleStore’s design is monolithic, Chen lamented. Because it is run as a single stand-alone query engine, it must compete with the best of both OLTP and OLAP engines already in use. And those willing to adopt an entirely new database system simply to get the benefits of fast analytics on fresh data (for actions such as fraud detection) are relatively few in number. Mooncake Bridges PostgreSQL and Iceberg Engines Instead of trying to build “a magical engine” (Chen’s words) that does both kinds of processing, why not just recreate the functionality as a feature for existing systems? Mooncake set out to build a “composable” hybrid database system, Chen said. It is a framework and set of new features built on top of existing OLTP systems and OLAP formats. The engineering team chose to support PostgreSQL for transactions, for its runaway popularity as an open source database system. On the analytics side, they went with the open lakehouse formats of Apache Iceberg and (Databricks’ own) Delta Lake, so that data in either of these formats can be accessed by any conversant engine (DuckDB, StarRocks, Trino, Apache Spark). Mooncake: Not an Engine, Just a Feature Mooncake has two main components. One (“moonlink”) is a real-time layer on top of Iceberg that allows for a “sub-second ingestion” of data. The second component (“pg_mooncake”) provides HTAP capability for PostgreSQL, allowing users to add analytical functions to determine transactional routing decisions. Together, they provide a step forward in the endless divide of transactional and analytics systems, making a bridge to a world of new possibilities from fast analytics. The agents will be pleased. Check out Chen’s entire talk for a technical deep dive into the challenges of getting Mooncake play nicely with both Iceberg and PostgreSQL: The post Mooncake Brings Databricks Rich Transactional Processing appeared first on The New Stack.
Read more →
Show HN: Kraa – Writing App for Everything
2025-12-04 07:35 | Source: Hacker News
Comments
Read more →
Louie Mantia on The Talk Show in July, Talking About Alan Dye and Liquid Glass
2025-12-03T22:30:35Z | Source: Daring Fireball
Back in July, I was lucky enough to have my friend Louie Mantia on The Talk Show to talk about Liquid Glass and (as I wrote in the show notes) “the worrisome state of Apple’s UI design overall”. This was probably my favorite episode of the show all year, and I think it holds up extremely well now that we’re all using Liquid Glass, across Apple’s platforms, in release versions. Included in the show notes was a link to Mantia’s essay making his case against Dye’s decade-long stint leading Apple’s UI design teams, “A Responsibility to the Industry”, which began thus: Firstly, I maintain that it makes absolutely no sense that Alan Dye has the power he has, because he simply has no taste. But what’s worse is that he wields that power so clumsily, so carelessly. And because it goes unchallenged, unchecked by someone higher than him, the entire industry suffers the consequences. Here’s Mantia today, regarding the news of Dye leaving Apple for Meta: And good riddance!! ★
Read more →
Alan.app
2025-12-03T22:14:04Z | Source: Daring Fireball
Tyler Hall, just one week ago: Maybe it’s because my eyes are getting old or maybe it’s because the contrast between windows on macOS keeps getting worse. Either way, I built a tiny Mac app last night that draws a border around the active window. I named it “Alan”. In Alan’s preferences, you can choose a preferred border width and colors for both light and dark mode. That’s it. That’s the app. The timing of this is remarkably serendipitous — releasing an app named “Alan” to fix an obvious glaring design shortcoming in recent versions of MacOS just one week before Alan Dye left Apple. (See Michael Tsai for more on the app’s name, including a callback to Greg Landweber’s classic Mac OS extension Aaron.) It’s worth following Hall’s “the contrast between windows” link, which points to his own post from five years ago lamenting the decline in contrast between active and inactive windows in MacOS. That 2020 post of Hall’s refers back to Steve Jobs’s introduction of Mac OS X 10.5 Leopard in 2007: As I was preparing the above video for this post, I completely forgot there was a final feature about the new Leopard Desktop that was highlighted in that keynote. Jobs took time out of a keynote to callout that it was now easier to tell which window is focused. At 1:29 in that clip, you’ll hear an outsized “Wooo!” from some of the audience just for this one improvement. Jobs even prepared a slide, highlighting “Prominent active window” as a noteworthy new feature. In 2007, the increase of visual prominence for the active window, going from 10.4 Tiger to 10.5 Leopard, drew applause from the audience. But the level of visual prominence indicating active/inactive windows was much higher in 10.4 Tiger than in any version of MacOS in the last decade under Alan Dye’s leadership. Nick Heer on Alan (the app, and, indirectly, the man): I wish it did not feel understandable for there to be an app that draws a big border around the currently active window. That should be something made sufficiently obvious by the system. Unfortunately, this is a problem plaguing the latest versions of MacOS and Windows alike, which is baffling to me. The bar for what constitutes acceptable user interface design seems to have fallen low enough that it is tripping everyone at the two major desktop operating system vendors. ★
Read more →
Nick Heer Obtained Video of Alan Dye’s Exit From Apple
2025-12-03T21:43:50Z | Source: Daring Fireball
That doesn’t look like one of the fancy Mitsubishi traction elevators at Apple Park, but otherwise, this jibes. ★
Read more →
Alan Dye Leaves Apple for Meta, Replaced by Longtime Designer Stephen Lemay
2025-12-03T19:59:22Z | Source: Daring Fireball
Mark Gurman, with blockbuster news at Bloomberg: Meta Platforms Inc. has poached Apple Inc.’s most prominent design executive in a major coup that underscores a push by the social networking giant into AI-equipped consumer devices. The company is hiring Alan Dye, who has served as the head of Apple’s user interface design team since 2015, according to people with knowledge of the matter. Apple is replacing Dye with longtime designer Stephen Lemay, according to the people, who asked not to be identified because the personnel changes haven’t been announced. Apple confirmed the move in a statement provided to Bloomberg News. “Steve Lemay has played a key role in the design of every major Apple interface since 1999,” Chief Executive Officer Tim Cook said in the statement. “He has always set an extraordinarily high bar for excellence and embodies Apple’s culture of collaboration and creativity.” It sounds like Dye chose to jump ship, and wasn’t squeezed out (as it seems with former AI chief John Giannandrea earlier this week). Gurman/Bloomberg are spinning this like a coup for Meta (headline: “Apple Design Executive Alan Dye Poached by Meta in Major Coup”), but I think this is the best personnel news at Apple in decades. Dye’s decade-long stint running Apple’s software design team has been, on the whole, terrible — and rather than getting better, the problems have been getting worse. ★
Read more →
Look How They Massacred My Boy
2025-12-03T02:37:09Z | Source: Daring Fireball
Todd Vaziri, on the HBO Max Mad Men fiasco: It appears as though this represents the original photography, unaltered before digital visual effects got involved. Somehow, this episode (along with many others) do not include all the digital visual effects that were in the original broadcasts and home video releases. It’s a bizarro mistake for Lionsgate and HBO Max to make and not discover until after the show was streaming to customers. I decided to help illustrate the changes by diving in and creating images that might do better than words. The first thing I noticed is that, at least for season one, the episode titles and order were totally jumbled. The puke episode is “Red in the Face”, not “Babylon”. So HBO Max not only ruined several episodes by “remastering” the wrong footage, but they both mis-numbered and mis-titled the episodes. Breathtaking ineptitude. Think about it. This is the entire raison d’être — streaming high quality movies and episodic series. That’s the one and only thing HBO Max does. And they have zero care or craft for what they do. They didn’t just do this to any show. They did it to one of the most cinematically beautiful and carefully crafted shows ever made. Vaziri’s post, as is his wont, is replete with illustrated and animated examples of the mistakes in HBO’s versions compared to the correct originals available from AMC and iTunes. Vaziri notes: The fun thing about this restoration mistake is that now we, the audience, get to see exactly how many digital visual effects were actually used in a show like “Mad Men”, which most would assume did not have any digital effects component. In this shot, not only were the techs and hose removed, but the spot where the pretend puke meets Slattery’s face has some clever digital warping to make it seem like the flow is truly coming from his mouth (as opposed to it appearing through a tube inches from his mouth, on the other side of his face). ★
Read more →
What Adobe’s New AI Assistant Can Teach Frontend Developers
2025-12-03 23:00 | Source: The New Stack
It’s tempting to see web and application accessibility as altruistic rather than profitable. But that’s not true, contends Navya Agarwal, a senior software engineer and technical lead at Adobe who focuses on frontend development. Agarwal is also an accessibility expert who actively contributes to the W3C Accessible Rich Internet Applications (ARIA) Working Group. “Building equitable products isn’t simply about altruism,” said Agarwal. “It can create opportunities for market expansion, penetration and sustainable growth. So that’s a section that is often left out by someone who is developing a new product, but building for all makes sure that you are getting more revenue at the end.” Adobe’s AI Assistant Prioritizes Accessibility Agarwal was on the team that built Adobe Express’ new AI Assistant, which was released in October and is in beta. The AI assistant soon will be integrated with ChatGPT Plus as well, she added. The assistant is basically a generic conversational interface designed to make creativity more accessible and intuitive for everyone, she said. “What we want to present to the world is a more humanly centered model where you focus on the intention, and the system helps you orchestrate everything else around you so it can go from any possibilities, basically creating images, rewriting content, making quick, quick edits, anything,” she said. Accessibility is often considered an add-on, rather than an essential part of the product. That’s why it’s often layered on top of the existing product created for a general audience, rather than embedded into the product development process. Adobe Express AI Assistant was designed to support accessibility from its inception. “It expands to cognitive disabilities, for example, things like ADHD, dyslexia, which are not really talked about right now; it’s underrepresented,” she said. “For example, if someone is going on a website who is facing dyslexia and ADHD, the website looks cluttered.” The offering shows what’s possible when AI is applied to accessibility. While many think of accessibility as relevant to the vision or hearing impaired, with AI it can accommodate other challenges as well. For instance, the Adobe Express AI Assistant can change design to be less cluttered for those with ADHD, autism or other sensory issues. It can also just be helpful to people as they age, she added. “Just imagine that you have agent where you only have a voice command; you’re just talking and it is … giving you the results,” she said. “All these are use cases that can be served with adaptive technology.” While AI does introduce the risk of hallucinations, Agarwal sees that as a lesser evil than having no text descriptions or support at all. “It expands to cognitive disabilities, for example, things like ADHD, dyslexia, which are not really talked about right now; it’s underrepresented.” — Navya Agarwal, senior software engineer and technical lead at Adobe As the tech world moves toward agentic AI, she foresees users having a digital personal shopping assistant to help users find clothes based on preferred parameters. Benefits to Developers With AI, developers are no longer limited to tactics that only assist the vision or hearing impaired, she said. Instead, users can tell the assistant their accommodation needs and the AI can create those, she said. That means users don’t have to tolerate a cluttered site or toolbar, for example; they can just talk to the web using voice commands or writing prompts. Some screen readers have already added a feature that lets users request an image description from ChatGPT or Claude without having to switch context, she said. Previously, developers could only add an alt-text description to an image that says something simple — this is a long-sleeve knitted jumper in black that’s 100% cotton, for instance. “But it doesn’t tell you so many different things, whether it’s lightweight or whether it’s chunky, etc.,” Agarwal said. “As AI enters the system, now we can just simply have our image being described in the context by using ChatGPT or Claude. Basically, my screen reader already has a feature that lets me request an image description from ChatGPT or Claude without having to switch context to do it.” Incorporating accessibility also offers benefits to developers themselves, she added. “By embedding more equitable practices into our product development process up front, rather than as an afterthought, we can enable teams to launch products faster, with lower risk and greater success for broader audiences,” Agarwal said. The post What Adobe’s New AI Assistant Can Teach Frontend Developers appeared first on The New Stack.
Read more →
Welcome to AI’s Messy Middle: Where 36x Gains Require Distinguished Engineers
2025-12-03 22:00 | Source: The New Stack
LAS VEGAS — Amazon Web Services CEO Matt Garman had a story to tell about Kiro, its new agentic IDE, in his keynote at AWS re:Invent. A distinguished engineer at his company, Anthony, led a team that rearchitected a project in 76 days with six developers. They had initially expected it to take 18 months with 30 people. Eye-opening stuff, enough to make an engineering lead run to the agent vending machine. Garman shared his story this fall with customers, who asked how Anthony’s team did it. That kind of question will be asked for a long time, which in itself reveals how little people know about the infrastructure, the model and how to use the agents that power what AWS needed a distinguished engineer and team to accomplish. Welcome to the messy middle. We are in the middle ages of AI workload development, deployment and management. It’s the messy middle, or the fun times, as one leading engineer said to me. It just depends on how you look at it. The cloud took 10 or more years to mature. AI’s maturity might take half that time or even less. In the product announcements at re:Invent, Garman showed how fast the pace is moving. But strikingly, these are innovations without established practices. It’s still more about how you achieve these fascinating results than about standardizing best practices, so you don’t have to build from scratch with GPUs, a dizzying number of models and agentic workflows that are brand new to everyone. Garman highlighted AWS’ massive scale. Yes, it generates $132 billion in annual revenue and has deployed 1 million Trainium chips, but that comes with trade-offs. Tech companies are inventing new architectures that are very cool. But at the same time, users are trying to use this new hardware with little understanding of how the infrastructure fits into their enterprise operations. Rapid development is exciting, but the quest for optimal architecture will take time and require significant adaptation, which is very new to most customers. Rapid Infrastructure Development Garman announced that Trainium is now generally available and previewed Trainium 4. AWS also launched both P6 GB200 and GB300 instances. Map these announcements to the issues that companies like Uber face, and you get a sense that the challenges with moving from cloud native to AI native will only get tougher. At KubeCon + CloudNativeCon North America last month, Uber talked a lot about how it uses multiple clouds, and what it takes to optimize AI workloads across them. Customers need these choices, but the reality has caught up to Uber, and it will for more and more customers as well. And what will it take to train the models? The people with capital and engineering talent will thrive. It’s a time of disruption, but how polarized will it get for the haves and have-nots? Case in point: Garman talked about an entire AWS campus dedicated to training Project Rainier for Claude, Anthropic’s large language model (LLM). That’s a whole campus for one project, a scenario that is outside what most companies can afford to do — or even have the talent to consider. Garman said AWS will offer AI factories, but inside enterprises. Why is that? The repatriation trend signals that customers want their data on their own infrastructure. It’s a significant shift. Cloud is still king, but there’s another constraint to consider: Power is the bottleneck. AWS will build what it compares to AWS regions. These are vertically integrated capabilities with Bedrock and other AWS services built in. But here’s the catch: The customer is responsible for providing the power and all the data center requirements to run AI workloads. Models, Models, Everywhere AWS announced four new Nova models: Amazon Nova Micro is text-only, helping with latency issues. Amazon Nova Lite is a multimodal model. Amazon Nova Pro is also a multimodal model with enhancements for accuracy, speed and cost. Amazon Nova Premier is the company’s most sophisticated model. Garman also discussed supporting models from Anthropic, OpenAI, Cohere and others. And Nova Forge is used to create versions of the AWS models, which they call novellas. The goal: Make it more affordable to build a model from scratch. In every technology era, proliferation is the rule, not the exception. After more than a decade of cloud native distributed workloads, convergence is now an aspiration with the proliferation of GPUs. We are in the age of specialization, not general workloads. At KubeCon, Uber’s Andrew Leung pointed to his company’s own struggle to get convergence — and it’s a leader in using AI workloads. Garman, for his part, stated, “We’ve never believed that there was going to be one model to rule them all.” But the proliferation does impact convergence, allowing enterprises to maintain vast, distributed workloads. At re:Invent, Gaman talked about the extensive choice in models. But he did not address the big challenge engineers face: CPUs and GPUs are comparable but not interchangeable in practice. The best example comes from AWS. Garman talked about Kiro, the platform AWS developed. “Now I want to take a quick moment and dive deeper into one of the stories we heard,” he told the re:Invent audience. “The details are pretty high. This was a quote from Anthony, one of our distinguished engineers. Anthony was working on a rearchitecture project … ” But where are the details of the case study? Who is Anthony? And for a company like AWS, why did it take weeks? AWS sits in a great place. The Kiro team, being an AWS team, knows what infrastructure and which models to use. The team can adapt as it controls all aspects of the product development. But it still took weeks for those team members to reach the point where they could devise a real plan. They needed to figure out what the agents could and could not do. And this is one team. It raises questions about how AWS is faring in building out agentic architectures and managing state — all the sorts of issues that customers have limited resources to address. And then there’s why we are hearing about Anthony. His team succeeded dramatically. That says a lot in itself. What followed? How that team’s grand success led to AWS’ big news. “In fact, we’ve been so blown away that last week, all of Amazon decided to standardize on Kiro as our official AI development environment,” Garman said. How AI Agents Are Like Teenagers AWS is just starting its journey. It’s terrific how deeply its CEO’s excitement runs for AI workloads. The fact that people are asking how to follow its lead shows the approach is just starting to be used. The “messy middle” theme became evident throughout the keynote. Garman compared agents to raising teenagers. They need ground rules; agents need supervision. They’re young — there’s a lot to learn. The excitement at re:Invent is palpable. The keynote told about a grand new world where infrastructure and models serve as the foundation for agentic AI, and maybe even the wonders of a new world that can change so much. But these are new times. It’s really cool, but the knowledge is not that transferable. Not quite yet. The post Welcome to AI’s Messy Middle: Where 36x Gains Require Distinguished Engineers appeared first on The New Stack.
Read more →
Combining Rust and Python for High-Performance AI Systems
2025-12-03 21:00 | Source: The New Stack
Python powers most AI and machine learning (ML). With its rich ecosystem — from TensorFlow and PyTorch to scikit-learn and Hugging Face Transformers — Python has become the go-to language for researchers, data scientists and engineers. But Python has a well-known limitation: speed. Its global interpreter lock (GIL) restricts concurrency, while its interpreted nature makes it orders of magnitude slower than compiled languages like C++ or Rust. On the other side of the spectrum is Rust: a systems programming language that delivers C++-level performance, memory safety without garbage collection and modern developer ergonomics. Rust is designed to handle high-performance, concurrent workloads — exactly the kind of workloads AI applications commonly demand in production. So, why not use the best of both worlds? Prototype and train models in Python, leveraging its mature ML ecosystem. Push performance-critical components (data processing, inference kernels, parallel workloads) to Rust and call them seamlessly from Python. This hybrid approach isn’t just theoretical, it already powers some of the most popular AI libraries today: Hugging Face Tokenizers are written in Rust for blazing speed with Python bindings for usability. Polars, a Rust-powered DataFrame library, routinely outperforms pandas while keeping a familiar Python interface. In this article, we’ll explore how to combine Rust and Python for building high-performance AI systems. You’ll learn: Why Rust complements Python in AI/ML. How to integrate Rust into Python with tools like PyO3 and Maturin. Practical examples of writing Rust functions, exposing them as Python modules and using them in AI workflows. Real-world case studies signalling the future of hybrid AI development. By the end, you’ll see how Rust can help overcome Python’s performance bottlenecks — without giving up the flexibility and ecosystem that make Python indispensable. Python has earned its dominance in AI and ML because of its simplicity and vast ecosystem. From NumPy to PyTorch and scikit-learn, most cutting-edge models and research code start in Python. But as projects transition from research to production, Python’s weaknesses start to show. This is where Rust shines. Let’s break down the complementarity. First and foremost: why Rust complements Python in AI/ML. 1. Performance at Scale Python is interpreted, and even with tools like NumPy or Cython, it struggles with raw computational throughput. Rust compiles to native machine code and offers C++-level performance with modern tooling. Heavy numerical kernels, matrix operations or custom ML layers can be implemented in Rust and called from Python, delivering massive speedups without rewriting the entire pipeline. Example: Hugging Face’s tokenizers library achieved significantly greater performance improvements than its pure Python counterpart by rewriting the core in Rust. 2. Concurrency Without the Global Interpreter Lock Python’s GIL prevents true, multithreaded execution of Python bytecode. This is a bottleneck when processing large datasets or running parallel inference workloads. Rust has fearless concurrency: Its ownership and borrowing system ensures memory safety across threads, enabling efficient multithreaded data loaders, parallel preprocessing or distributed workloads — things Python alone struggles with. 3. Memory Safety Without Garbage Collection C++ is traditionally used for speed, but it comes with risks, like segmentation faults and memory leaks. Rust guarantees memory safety at compile time with zero-cost abstractions — no runtime overhead, no dangling pointers, no null dereferences. For AI systems running 24/7 in production (think cloud inference services or edge devices), this reliability is critical. 4. Ecosystem Synergy Python has mature AI/ML libraries, but Rust’s ecosystem is growing in complementary areas, including: Polars (DataFrames) for high-performance data processing. Burn (deep learning framework in Rust). tch-rs (bindings to LibTorch for training and inference). Many Rust libraries provide Python bindings out of the box, letting developers integrate them without leaving Python’s comfort zone. 5. Production-Grade AI Services Training is usually done in Python. However, serving models at scale demands speed, stability and efficiency. Rust is increasingly used to build inference servers and APIs (via Axum, Actix-web or gRPC). This allows teams to keep training pipelines in Python while deploying Rust-backed services that are lean, safe and fast. How To Integrate Rust Into Python With PyO3 and Maturin There are several ways to connect Rust and Python (FFI, cffi, ctypes, etc.), but the most developer-friendly approach today is using: PyO3, a Rust library for writing Python bindings. Maturin, a build tool that compiles Rust code into Python packages (wheels). This combination lets you: Write Rust code. Compile it into a Python module. Import it with import my_rust_module just like any normal Python package. Step 1: Install Dependencies Make sure you have: Rust (latest stable): curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh Python (≥3.8 recommended). Maturin (install via pip): pip install maturin Step 2: Create a New Rust Project Make a new Rust library project: cargo new --lib rust_python_demo cd rust_python_demo Next, update Cargo.toml to include PyO3: [package] name = "rust_python_demo" version = "0.1.0" edition = "2021" [lib] name = "rust_python_demo" crate-type = ["cdylib"] [dependencies] pyo3 = { version = "0.22", features = ["extension-module"] } Step 3: Write Rust Code (With Python Bindings) Open src/lib.rs and replace its contents: use pyo3::prelude::*; use pyo3::wrap_pyfunction; /// A simple function to add two numbers. #[pyfunction] fn add_numbers(a: i32, b: i32) -> i32 { a + b } /// A function that computes dot product of two vectors. #[pyfunction] fn dot_product(vec1: Vec<f64>, vec2: Vec<f64>) -> PyResult<f64> { if vec1.len() != vec2.len() { return Err(pyo3::exceptions::PyValueError::new_err( "Vectors must be of the same length", )); } Ok(vec1.iter().zip(vec2.iter()).map(|(x, y)| x * y).sum()) } /// Define the Python module #[pymodule] fn rust_python_demo(_py: Python, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(add_numbers, m)?)?; m.add_function(wrap_pyfunction!(dot_product, m)?)?; Ok(()) } Step 4: Build the Python Package Run Maturin in develop mode (so you can import locally): maturin develop This compiles the Rust code into a Python module (rust_python_demo) and installs it into your current Python environment. Step 5: Use in Python Now, open a Python shell or script: import rust_python_demo print(rust_python_demo.add_numbers(5, 7)) # Output: 12 print(rust_python_demo.dot_product([1.0, 2.0, 3.0], [4.0, 5.0, 6.0])) # Output: 32.0 It works just like any other Python module, but the core logic is running at Rust speed. Practical Example: Rust Functions in Python AI Workflows Fast Data Preprocessing with Rust Data preprocessing is often a bottleneck in ML pipelines. To normalize a dataset (scale values between 0 and 1) in Python, this would be written with loops or NumPy. Here’s how to implement it in Rust and call it from Python. Rust (src/lib.rs): use pyo3::prelude::*; use pyo3::wrap_pyfunction; /// Normalize a list of floats between 0 and 1 #[pyfunction] fn normalize(data: Vec<f64>) -> PyResult<Vec<f64>> { if data.is_empty() { return Ok(vec![]); } let min = data.iter().cloned().fold(f64::INFINITY, f64::min); let max = data.iter().cloned().fold(f64::NEG_INFINITY, f64::max); if (max - min).abs() < f64::EPSILON { return Ok(vec![0.0; data.len()]); // all values the same } Ok(data.iter().map(|x| (x - min) / (max - min)).collect()) } #[pymodule] fn rust_python_demo(_py: Python, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(normalize, m)?)?; Ok(()) } Python: import rust_python_demo import numpy as np data = np.random.rand(1_000_000).tolist() normalized = rust_python_demo.normalize(data) print(f"First 5 normalized values: {normalized[:5]}") With large datasets, the Rust version is significantly faster than pure Python loops. Real-World Use Case Studies This hybrid approach is already proven in production: Hugging Face Tokenizers. Originally in Python, too slow for large-scale natural language processing (NLP) preprocessing. Rewritten in Rust with Python bindings. Achieved significant speedups. Polars DataFrame Rust core + Python bindings. Outperforms pandas in many data manipulation tasks. Growing adoption in ML pipelines for big data preprocessing. PyTorch + Custom Ops Researchers implement custom tensor operations in C++ for performance. Rust bindings (tch-rs) are opening new doors for safer, modern low-level ops. The Future of Hybrid AI Development We’re seeing a clear trend: Python remains the interface language for research, prototyping and orchestration. Rust is emerging as the performance layer in AI systems for data handling, inference and deployment. New Rust-native ML frameworks like Burn and Linfa show that Rust might eventually compete head-to-head with Python libraries. In the near future, expect: More Rust-backed Python libraries (following the Hugging Face / Polars model). Increased use of Rust for production inference servers, while training stays in Python. AI edge devices and WebAssembly deployments relying heavily on Rust’s portability and efficiency. The bottom line: AI thrives on Python’s flexibility and vast ecosystem of libraries. But as we’ve seen, Python alone struggles with performance bottlenecks, concurrency limitations and the demands of production-grade systems. This is where Rust becomes the perfect companion. By integrating Rust into Python workflows: You gain near-C++ performance while keeping the expressiveness and ecosystem of Python. You overcome the GIL with Rust’s fearless concurrency. You deploy safer, more reliable AI services that can run at scale without memory leaks or runtime crashes. These practical examples — from data normalization to dot product benchmarking — show how easy it is to expose Rust functions as Python modules using PyO3 and maturin. These aren’t just academic exercises; they mirror real-world use cases already adopted by industry leaders. Hugging Face, Polars and others are proving that the hybrid Rust + Python model works in the real world. Looking ahead, we’re likely to see: More Rust-backed Python libraries that keep Python at the forefront of research but quietly replace slow Python cores with blazing-fast Rust implementations. Growing adoption of Rust in production inference services, particularly for edge devices and real-time AI. A gradual rise of Rust-native ML frameworks that may one day rival TensorFlow and PyTorch. The future of AI development is not Python or Rust. It’s Python and Rust together, a partnership that combines the best of both worlds: Python’s ease of use with Rust’s uncompromising performance and safety. For developers and teams, the message is clear: You don’t need to abandon Python to build high-performance AI. Instead, embrace Rust where it matters most — in the performance-critical, parallel and safety-sensitive layers of your stack. The post Combining Rust and Python for High-Performance AI Systems appeared first on The New Stack.
Read more →
AWS Makes It Easier To Customize AI Models in Bedrock and SageMaker Without a Ph.D.
2025-12-03 19:30 | Source: The New Stack
LAS VEGAS — Yesterday, AWS announced Nova Forge, a new way for enterprises to customize Amazon‘s family of Nova large language models (LLMs) with their own data. Today, it’s addressing a very similar need by adding model customization options to its Amazon Bedrock and SageMaker AI services. As Swami Sivasubramanian, AWS’s VP of Agentic AI, told me in an interview ahead of today’s announcement, serverless model customization in SageMaker takes a different approach from what the company is doing with Nova Forge. SageMaker AI Model Customization At its core, SageMaker had always been about building machine learning models — with foundation models only recently added to the mix — based on a company’s own data, and then helping them deploy and manage those models over their lifecycle. “This is different from the Nova Forge, where you can actually, as an engineer who doesn’t know anything about [supervised fine-tuning], RL [Reinforcement Learning] or any of it, you can chat with the agent and say: ‘Here is my use case. Here is the data set I have. How should I customize it?’ And it will guide you through, all the way from supervised fine-tuning to RL to how to go about it. And then it’ll kickstart all of it end-to-end.” As part of this process, the tool will even generate its own synthetic data. For developers who want more control, there is also a second agentic experience (AWS describes this one as the “self-guided” approach). Developers get more control over every step of the process, but as AWS notes, they still won’t have to manage any of the infrastructure that runs these processes and instead get to focus on finding the right customization techniques and tweaking those. Sivasubramanian stressed that this capability was previously only available to specialized AI scientists and out of reach for most developers. He also noted that this is a fully serverless product — like the rest of SageMaker. Reinforcement Fine-Tuning on Bedrock As for Bedrock, which is AWS’s fully-managed service for accessing foundation models from Amazon itself, Anthropic, Mistral and others, the focus is on Reinforcement Fine Tuning (RFT). As with Nova Forge, AWS argues that it remains too hard for developers to set up the training pipelines and infrastructure to effectively use this technique to tune models for their specific use cases. Reinforcement Fine-Tuning essentially involves tuning a model to perform well on a given task by having another model grade every answer, with those answers then being incorporated into the model’s weights. As with other RL techniques, this is a reward-based system, with the grading model providing those scores and rewards. For this service, developers can choose different reward functions — AI-based, rule-based or a ready-to-use template — and Bedrock will handle the fine-tuning process from there. “No Ph.D. in machine learning required — only a clear sense of what good results look like for the business,” AWS notes in its press release. AWS argues that it is seeing an average of 66% accuracy gains over base models for its customers who use this technique — all while also making the models easier and faster to run. Competition It’s worth noting that AWS isn’t the first to market with many of these features. Google’s Vertex AI offers a model customization suite that offers quite a few reinforcement learning options. Similarly, Microsoft’s AI Foundry also offers fine-tuning services. The post AWS Makes It Easier To Customize AI Models in Bedrock and SageMaker Without a Ph.D. appeared first on The New Stack.
Read more →
How To Build AI Agents 3 Ways With Google ADK
2025-12-03 19:00 | Source: The New Stack
Google‘s Agent Development Kit (ADK) has rapidly become a foundational framework for building AI agents. Introduced at Google Cloud NEXT 2025, ADK powers agents across Google products, including Gemini Enterprise and the Google Customer Engagement Suite. What makes ADK particularly compelling for developers is its flexibility. Developers can build agents using Python code, YAML configuration files or a drag-and-drop visual interface — depending on workflow preferences and use case requirements. In this tutorial, I’ll walk you through all three approaches to building your first “Hello World” agent with ADK. By the end, you’ll have a functional agent running locally using each method, giving you a foundation to choose the right approach for your projects. Understanding the 3 Approaches Before diving into implementation, let’s understand what each approach offers: Imperative agents (Python): This code-first approach gives you maximum flexibility and control. You define agent logic, tools and orchestration directly in Python, making it ideal for complex agents that need custom logic, integration with existing codebases or sophisticated multiagent systems. The Python approach also supports any large language model (LLM) through LiteLLM integration. Declarative agents (YAML): Introduced in August 2025 with the Agent Config feature, this approach lets you define agents using YAML configuration files. It reduces boilerplate and makes agents easier to understand at a glance — particularly useful for simpler agents or when you want non-developers to understand agent behaviour. Visual Agent Builder (GUI): Launched in ADK v1.18.0, the Visual Agent Builder is a browser-based IDE that combines a visual workflow designer, configuration panels and an AI assistant. You can design multiagent systems through drag-and-drop interactions and natural language conversations, with the tool generating proper YAML configurations under the hood. Prerequisites Before we begin, ensure you have the following: Python 3.10 or higher A code editor Terminal access Either a Google AI Studio API key or a Google Cloud project with Vertex AI enabled Step 1: Setting up the Environment Let’s start by creating a virtual environment and installing ADK. Open your terminal and run: python -m venv .venv source .venv/bin/activate Install the ADK package: pip install google-adk Verify the installation: adk --version You should see the installed ADK version (1.18.0 or higher is required for the Visual Agent Builder). Step 2: Configure Model Access ADK needs access to an LLM. The simplest option for getting started is using Google AI Studio with a free API key. Obtain your API key from Google AI Studio and keep it accessible, as you’ll need it for the next steps. Approach 1: Building an Imperative Agent With Python The imperative approach is the most powerful method, giving you full control over agent behavior through code. Let’s build a simple greeting agent that demonstrates the core concepts. Create the Project Structure Create a new directory for your agent project: mkdir hello_agent cd hello_agent Create the following files inside the hello_agent directory: __init__.py from . import agent This file marks the directory as a Python package and imports the agent module. agent.py from google.adk.agents import Agent def greet_user(name: str) -> dict: """Greets a user by name. Args: name (str): The name of the user to greet. Returns: dict: A greeting message with status. """ return { "status": "success", "message": f"Hello, {name}! Welcome to Google ADK. I'm your first AI agent!" } def get_agent_info() -> dict: """Returns information about this agent. Returns: dict: Information about the agent's capabilities. """ return { "status": "success", "info": "I am a Hello World agent built with Google ADK using Python. " "I can greet users and tell them about myself." } root_agent = Agent( name="hello_agent", model="gemini-2.0-flash", description="A friendly greeting agent that welcomes users to Google ADK.", instruction="""You are a friendly and helpful greeting agent. Your primary purpose is to: 1. Greet users warmly when they provide their name using the greet_user tool 2. Explain what you are when asked using the get_agent_info tool 3. Be enthusiastic about introducing users to Google ADK Always use the available tools to respond appropriately to user requests.""", tools=[greet_user, get_agent_info], ) The Agent class is the core building block in ADK. Notice how we define tools as regular Python functions with type hints and docstrings. ADK uses these to help the LLM understand when and how to call each tool. .env GOOGLE_GENAI_USE_VERTEXAI=0 GOOGLE_API_KEY=YOUR_API_KEY_HERE Replace the placeholder values with your actual credentials. Run the Agent Navigate to the parent directory of your agent folder: cd .. Run the agent in terminal mode: adk run hello_agent You should see a prompt indicating the agent is running: Running agent hello_agent, type exit to exit. [user]: Try these interactions: [user]: Hello, my name is Jani [user]: What can you do? [user]: Tell me about yourself The agent will use the appropriate tools to respond. Type exit to quit. Approach 2: Building a Declarative Agent With YAML The declarative approach using YAML configuration files simplifies agent creation, especially for straightforward use cases. The Agent Config feature generates the same underlying agent structure but with less code. Create the Config-Based Project Use the ADK CLI to generate a config-based agent project: adk create yaml_hello_agent --type=config Accept the defaults and complete the steps. Define the Agent in YAML Open yaml_hello_agent/root_agent.yaml and replace its contents with: name: hello_yaml_agent model: gemini-2.0-flash description: A friendly greeting agent built with YAML configuration. instruction: | You are a friendly and helpful greeting agent. Your primary purpose is to: 1. Greet users warmly when they provide their name using the greet_user tool 2. Explain what you are when asked using the get_agent_info tool 3. Be enthusiastic about introducing users to Google ADK Always use the available tools to respond appropriately to user requests. tools: - name: yaml_hello_agent.greet_user - name: yaml_hello_agent.get_agent_info The YAML structure mirrors the Python Agent class parameters, but in a more readable format. Notice how tools are referenced by their module path. Create the Tools Module A powerful feature of ADK’s YAML config is that you can mix in Python code. Update __init__.py file in the yaml_hello_agent folder: def greet_user(name: str) -> dict: """Greets a user by name. Args: name (str): The name of the user to greet. Returns: dict: A greeting message with status. """ return { "status": "success", "message": f"Hello, {name}! Welcome to Google ADK via YAML config!" } def get_agent_info() -> dict: """Returns information about this agent. Returns: dict: Information about the agent's capabilities. """ return { "status": "success", "info": "I am a Hello World agent built with YAML configuration. " "I demonstrate the declarative approach to ADK agents." } Your project structure should now look like: yaml_hello_agent/ ├── root_agent.yaml ├── __init__.py └── .env Run the YAML-Based Agent Navigate to the parent directory and run: adk run yaml_hello_agent Test it with the same prompts: [user]: Hi, I'm Jani [user]: What are you? The agent responds identically to the Python version but is defined entirely through configuration. Approach 3: Building an Agent With the Visual Agent Builder The Visual Agent Builder, introduced in ADK v1.18.0, is a browser-based IDE that transforms how you build agents. It combines a visual workflow designer, configuration panels and an AI assistant that lets you design agents through drag-and-drop interactions and natural language conversations. Launch the Visual Agent Builder From any directory, run: adk web Open http://localhost:8000/dev-ui/ in your browser to access the Visual Agent Builder. Click the “+” button next to the dropdown and enter the name visual_hello_agent: Instead of manually configuring the agent, let’s use the AI Assistant. In the right panel, type: Create a simple greeting agent that can: 1. Greet users by name when they introduce themselves 2. Tell users about itself when asked Use gemini-2.5-flash as the model. Keep it simple with just two tools. The AI Assistant will generate a complete agent configuration, including: Proper agent name and description Model selection Detailed instructions Click the Save button, then exit the builder mode. You can now chat with the agent. Comparing the 3 Approaches After building agents with all three methods, here’s how they compare: Aspect Imperative (Python) Declarative (YAML) Visual Builder Best for Complex logic, CI/CD Simple agents, collaboration Prototyping, learning Learning curve Moderate Low Lowest Flexibility Highest Medium Medium Model support All (via LiteLLM) Gemini only Gemini only Version control Excellent Excellent Good (exports YAML) Non-developer friendly No Somewhat Yes Debugging Manual Manual Built-in tracing Key considerations: The Python approach is best when you need maximum control, custom integrations or support for non-Gemini models through LiteLLM. The YAML approach works well for straightforward agents where you want the simplicity of configuration files with the ability to mix in Python tools. The Visual Builder excels at rapid prototyping, learning ADK concepts and collaborating with non-developers who can describe requirements in natural language. In practice, these approaches complement each other. You might use the Visual Builder to prototype and understand an architecture, then export the YAML for version control and CI/CD pipelines. Looking Ahead This tutorial covered the essential steps to build your first AI agents using Google ADK’s three development approaches. Each method has its strengths, and the framework is designed so you can move fluidly between them — starting visually, exporting to YAML and dropping into Python when you need advanced functionality. In subsequent tutorials, we’ll explore advanced ADK capabilities, including multiagent systems with Sequential, Parallel and Loop patterns, tool integration with Model Context Protocol (MCP) servers, session management and memory persistence, and deployment to Vertex AI Agent Engine. The foundation you’ve built here will serve you well as we tackle increasingly sophisticated agentic workflows. The post How To Build AI Agents 3 Ways With Google ADK appeared first on The New Stack.
Read more →
When To Log, and When To Shut Up
2025-12-03 17:00 | Source: The New Stack
Let’s be honest: Most logs are just noise. [INFO] Starting process … probably. [DEBUG] Made it to line 42 — still alive. [TRACE] Function entered. Leaving soon. [INFO] User clicked a button. Which one? No idea. [WARN] Everything’s fine, just felt like warning you. [DEBUG] Variable x = 7. Might change. Might not. [INFO] Operation completed successfully (we think). [TRACE] Loop iteration #12 of infinite sadness. [DEBUG] Placeholder for meaningful message. [INFO] Shutting down gracefully … except when not. As developers, we too often sprinkle logs like confetti — every function entry, every variable, every heartbeat. Before long, terabytes of meaningless lines pile up, filling dashboards no one reads. We pay millions of dollars to observability vendors just to warehouse our junk. Every useless log line burns compute, disk and dollars. Logging without intent isn’t observability, it’s littering. Even with modern observability platforms that dramatically increase compression through columnar storage, there’s no reason to log everything. It still turns root cause analysis into a needle-in-a-haystack problem, diluting the signal you actually need — and you’ll pay more for the privilege. We need to be selective. Log what helps us understand the system, debug real issues or explain business impact — and shut everything else up. The Philosophy of a Log Line Every log line is a choice, not a reflex. If it’s not helping your future self track down a bug at 3 a.m., delete it. Logging isn’t journaling; keep it minimal, clear and actually useful. Before you hit logger.info, stop, and ask: Would I actually grep this? If not, delete it. Logs aren’t narration; they’re evidence. They exist to tell you what the system was thinking when things went wrong. Logs shouldn’t be relegated to the end of the observability chain. They’re not just a microscope for confirmation but a map for discovery. Sometimes the quickest way to insight is to explore raw text _ grep, filter and follow intuition. Logs invite curiosity — they reveal nuance that metrics might smooth over and context that traces can’t express. Treat them not as the final resort but as a living source of truth, open to exploration from the very start. Context or It Didn’t Happen “Error occurred” without inputs, IDs or state means nothing. Add enough context to reconstruct the moment — request IDs, user IDs, input parameters, operation names. These days, with OpenTelemetry, you get trace and span IDs for free. Use them. Logs connected to traces (and even metrics) by trace IDs are infinitely more valuable than isolated lines of text. Logs aren’t a standalone pillar; they’re the closing chapter of your root cause analysis. You alert on metrics, investigate through traces and then drop into logs to see what actually happened. When your logs are linked by trace and span IDs, they stop being noise and start being evidence — tightly scoped, contextual and directly tied to the path of a single request. That’s observability with intent, not a wall of text. Well-Structured, Not Free Text Free-text logging is obsolete. Structured logs, whether JSON, CSV or key-value, aren’t just easier to query; they’re the foundation for analytics. Once logs have structure, patterns emerge: “This error started spiking last week.” “This happens mostly after event X.” “This warning correlates with this specific deployment.” The future of logging isn’t reading one line; it’s seeing the pattern across thousands. Structured logging makes charting easy and efficient. While many observability platforms offer schema on read, that flexibility comes at a cost. Every query forces the system to scan and parse raw text, line by line, to infer a structure that should have existed in the first place. These queries are computationally expensive, slower and more difficult for a user to write. Prestructured logs flip that inefficiency on its head. When your data already has shape, you can take advantage of column-oriented storage and native aggregation — querying, visualizing and correlating events in milliseconds instead of minutes. Know When To Measure, Not Just When To Log Not every event belongs in the log stream. Some things deserve structure and timing — exactly what spans and metrics are for. If you’re measuring latency, user flow or distributed causality, emit a span instead. Spans capture duration, context and relationships across services and tell you why something was slow or broken, where a log can only shout that it happened. The same logic applies to metrics, turning repetitive logs into real signals you can alert on and aggregate efficiently. If you find yourself logging the same message hundreds of times per second, you’re not observing, you’re just wasting storage. Measure it once, summarize it, and let your metrics and traces do the heavy lifting. Log Levels Are For Humans, Not Machines Logging isn’t a personal debugging diary; it’s a shared artifact for your future teammates. Every line should help someone understand what happened without guessing what you meant. Write logs for the next incident, not your current mood. Your logs tell the story of your system. Make it one worth reading, for example: ERROR: Page a human. Something’s broken. WARN: Unexpected but survivable. Investigate later. INFO: Routine system behavior worth knowing. DEBUG/TRACE: Temporary developer insight — should rarely leave your laptop. Be deliberate. Don’t mark something as an error unless it truly requires action. Overusing ERROR numbs your alerts and trains teams to ignore what matters. Every log level should communicate intent: what needs fixing now, what needs watching and what can be ignored. That said, trace logging has its place. For example, behind the scenes for ClickHouse Cloud, we trace-log extensively to help our engineers diagnose performance issues and support customers at scale. It’s a deliberate exception — necessary when you operate a distributed database serving thousands of workloads in real time. For most applications, though, this level of verbosity isn’t observability; it’s just noise. Tools To Help You Log Less and Log Smarter Rich SDKs and powerful filters exist so you don’t have to just “log everything.” Use them. Modern OpenTelemetry Collector SDKs let you be prescriptive about what you log: You can instrument your code so that only meaningful log lines are created, and you can filter or drop everything else at ingest or collection time. For example: The filter processor supports dropping unwanted logs, metrics or traces using the OpenTelemetry Transformation Language (OTTL) with conditions like severity, resource attributes or content patterns. If your observability platform allows, you can filter at agent time, collector gateway or ingest time so unnecessary logs never get written, stored or indexed (saving you compute, storage and query cost). If administrators spot users generating frivolous logs, they can aggressively filter them, either in the pipeline or by forcing minimal logging policies. If you’re using a proprietary platform, these provide similar filtering or ingestion-control tools, even if they don’t shout about them publicly. Log With Purpose, or Don’t Log at All Observability isn’t about volume. It’s about clarity. Every log line should earn its place by explaining something that metrics and traces can’t. Logging without intent just burns money and buries insight. Be deliberate. Use structure. Add context. Know when to measure, when to trace and when to say nothing. Modern tools make it easier than ever to be disciplined, but the discipline still has to come from you. In the end, great logging isn’t about capturing everything that happens. It’s about capturing what matters. The post When To Log, and When To Shut Up appeared first on The New Stack.
Read more →
Your stack, your rules: Introducing custom agents in GitHub Copilot for observability, IaC, and security
2025-12-03 17:00 | Source: GitHub Engineering
Every engineering team has its unwritten rules. How you structure Terraform modules. Which dashboards you trust. How database migrations must be handled (never at midnight). And your work stretches across more than your editor into observability, security, CI/CD, and countless third-party tools. GitHub Copilot isn’t just here to help you write code. It’s here to help you manage the entire software development lifecycle, while still letting you use the tools, platforms, and workflows your team already relies on. Custom agents bring that full workflow into Copilot. We’re introducing a growing ecosystem of partner-built custom agents for the GitHub Copilot coding agent (plus the option to create your own). These agents understand your tools, workflows, and standards—and they work everywhere Copilot works: In your terminal through Copilot CLI for fast, end-to-end workflows In VS Code with Copilot Chat In github.com in the Copilot panel Let’s jump in. Run custom agents in the GitHub Copilot CLI Copilot CLI is the fastest way to run multi-step tasks, automate workflows, and integrate agents into scripts or CI. If you live in the terminal, custom agents feel like native extensions of your workflow. Get started with Copilot CLI > What custom agents actually are Custom agents are Markdown-defined domain experts that extend the Copilot coding agent across your tools and workflows. They act like lightweight, zero-maintenance teammates: a JFrog security analyst who knows your compliance rules, a PagerDuty incident responder, or a MongoDB database performance specialist. Defining one looks like this: --- name: readme-specialist description: Expert at creating and maintaining high-quality README documentation --- You are a documentation specialist focused on README files. Your expertise includes: - Creating clear, structured README files following best practices - Including all essential sections: installation, usage, contributing, license - Writing examples that are practical and easy to follow - Maintaining consistency with the project's tone and style Only work on README.md or documentation files—do not modify code files. Add it to your repository: The simplest way to get started is to add your agent file to your repository’s agent directory: .github/agents/readme-specialist.md Your agent appears instantly in: GitHub Copilot CLI github.com in the control plane VS Code in Copilot Chat You can also define agents at: Repository level: .github/agents/CUSTOM-AGENT-NAME.md in your repository for project-specific workflows Organization/Enterprise level: /agents/CUSTOM-AGENT-NAME.md in a .github or .github-private repository for broader availability across all repositories in your org Try a partner-built custom agent in under 60 seconds Custom agents are just Markdown files. Add it to your repository and run it from GitHub Copilot CLI, VS Code, or github.com. 1. Pick the agent you want to try. All partner-built agents are available today (and we have these in our repository, too), including: Observability: Dynatrace Expert, Elasticsearch agent Security: JFrog Security Agent, StackHawk Security Onboarding Databases: MongoDB Performance Advisor, Neon Migration Specialist, Neon Performance Analyzer, Neo4j Docker Client Generator DevOps & IaC: Terraform Agent, Arm Migration Agent, Octopus Release Notes Agent, DiffBlue Java Unit Test Custom Agent Incidents & project management: PagerDuty Incident Responder, Monday Bug Context Fixer Feature flags & experiments: LaunchDarkly Flag Cleanup, Amplitude Experiment Implementation Automation & APIs: Apify Integration Expert, Factory.ai Code Spec Custom Agent, Lingo.dev Internationalization Implementation Custom Agent 2. Add the agent to your repository. .github/agents/<agent-name>.agent.md 3. Use it.From the Copilot CLI: copilot --agent=<agent-name> --prompt "<task>" From your VS Code: Open Copilot Chat and select the agent Select the agent from the dropdown From github.com: Open the Copilot panel and select the Agents tab Choose the agent you added to your repository Describe your task Featured examples from our partners with real developer workflows Here are real engineering workflows, solved with a single command via custom agents. Trigger and resolve incidents faster (PagerDuty Incident Responder) copilot --agent=pagerduty-incident-responder \ --prompt "Summarize active incidents and propose the next investigation steps." Use this agent to: Pull context from PagerDuty alerts Generate a clear overview of incident state Recommend investigation paths Draft incident updates for your team Fix vulnerable dependencies and strengthen your supply chain (JFrog Security Agent) copilot --agent=jfrog-security \ --prompt "Scan for vulnerable dependencies and provide safe upgrade paths." Use this agent to: Identify vulnerable packages Provide recommended upgrade versions Patch dependency files directly Generate a clear, security-aware pull request summary Modernize database workflows and migrations (Neon) copilot --agent=neon-migration-specialist \ --prompt "Review this schema migration for safety and best practices." Use this agent to: Validate schema changes Avoid unsafe migrations Tune analytical workflows Optimize transformations and queries Speed up product experimentation and feature rollouts (Amplitude Experiment Implementation) copilot --agent=amplitude-experiment-implementation \ --prompt "Integrate an A/B test for this feature and generate tracking events." Use this agent to: Generate experiment scaffolding Insert clean, consistent event tracking Map variations to your product logic Ensure your data flows correctly into Amplitude Why this matters By encoding your team’s patterns, rules, and tool integrations into a reusable agent, Copilot actually understands how your team works—not just the code in front of it. Custom agents help: Keep patterns consistent (Terraform conventions, database rules, security standards, etc.) Stop repeating context by defining expectations once and reusing them everywhere Share expertise automatically so the entire team can follow best practices (even when your subject matter expert is on vacation or in a different timezone) Work directly with your tools using Model Context Protocol (MCP) servers to pull data from your DevOps, security, and observability systems The full catalog of custom agents from our partners We partnered across the ecosystem to create custom agents that solve real engineering problems. Observability and monitoring Dynatrace Observability and Security Expert: Configure and optimize Dynatrace monitoring for your applications Elasticsearch Remediation Agent: Handle Elasticsearch configuration, query optimization, and observability setup Security and compliance JFrog Security Agent: Identify and remediate security vulnerabilities in your dependencies StackHawk Security Onboarding: Set up dynamic application security testing Database and data management MongoDB Performance Advisor: Analyze and optimize MongoDB query performance Neon Migration Specialist: Migrate databases to Neon’s serverless Postgres Neon Performance Analyzer: Find bottlenecks and optimization opportunities Neo4j Docker Client Generator: Generate Docker-based client code for Neo4j graph databases DevOps and infrastructure Terraform Infrastructure Agent: Write, review, and optimize Terraform infrastructure as code Arm Migration Agent: Migrate applications to Arm-based architectures Octopus Release Notes Expert: Generate comprehensive release notes from deployment data DiffBlue Java Unit Test Custom Agent: Generate fast, reliable Java unit tests using DiffBlue’s AI-powered test generation engine to improve coverage and catch regressions automatically Incident response and project management PagerDuty Incident Responder: Triage and respond to production incidents Monday Bug Context Fixer: Pull context from monday.com to resolve bugs faster Feature management and experimentation LaunchDarkly Flag Cleanup: Identify and safely remove obsolete feature flags Amplitude Experiment Implementation: Implement A/B tests and experiments API integration and automation Apify Integration Expert: Integrate web scraping and automation workflows Lingo.dev Internationalization Implementation Custom Agent: Detect, extract, and implement internationalization patterns across your codebase for seamless localization Factory.ai Code Spec Custom Agent: Install, configure, and automate development workflows using Droid CLI for CI/CD pipelines Run any of them with the following command: copilot --agent=<agent-name> --prompt "<task>" Get started Custom agents shift Copilot from “help write this code” to “help build software the way our team builds software.” These agents are also available now for all GitHub Copilot users, and you should try one: copilot --agent=terraform-agent --prompt "Review my IaC for issues" Explore all the partner agents in the awesome-copilot repository (with plenty of real-world examples). Give us feedback to let us know what you think of custom agents in GitHub Copilot! Learn how to build your own custom agent with our documentation on creating custom agents (and how they work). The post Your stack, your rules: Introducing custom agents in GitHub Copilot for observability, IaC, and security appeared first on The GitHub Blog.
Read more →
Helm 4: What’s New in the Open Source Kubernetes Package Manager?
2025-12-03 16:00 | Source: The New Stack
Ever been to Kate’s Place? Most likely, you have. You just know it by a different name: Helm. Helm, an open source package manager for Kubernetes that began as a company hackathon project called Kate’s Place, turned 10 in 2025. At KubeCon + CloudNativeCon North America, Helm 4 was launched — the first new version in six years. Why so long between versions? We’ll get to that. But first, Matt Butcher, founder and CEO of Fermyon Technologies, a WebAssembly company acquired this month by Akamai, told me about the origins of Kate’s Place in this episode of The New Stack Makers Kate’s Place was a package manager for Kubernetes that Butcher and two other developers created more than a decade ago at a hackathon at his then-employer, Deus. The name was a play on “K8s,” and had a coffeehouse theme. “I think we were calling the packages shots or espressos or something like that,” Butcher said, in this On the Road episode recorded at KubeCon in Atlanta in November. At stake was a $75 gift card, which the Kate’s Place team won. The next day at the office, Butcher’s phone rang; Deus’ CEO and CTO were on the line. “And they said, ‘We think this idea of a package manager for Kubernetes is just the right thing at just the right time,’” he recalled. “Kubernetes was just gaining momentum, and nobody was doing anything like that at that point. And so, they said, ’Why don’t we just give you a team and you can go build it?’ And I said, ‘That sounds fantastic. I would love to do that.’ They said, ‘Just one thing. We really hate the name.’” WebAssembly Plugins Once it acquired its new name, Helm moved quickly, gaining non-Deus contributors (like Matt Farina, now chief architect of cloud native at SUSE, who joined Butcher for this episode). The project was announced at the first KubeCon and it was among the first projects to graduate from the Cloud Native Computing Foundation. Helm 4, which went live during the most recent KubeCon, was the product of a long gestation. “The first Helm was around for matter of months, and then Helm 2 was about a year,” Farina said. “Then Helm 3 was three years.” After six year of Helm 3, “you get some design debt, things like that. People get crazy ideas you never envisioned in the past that requires you to make breaking changes in a major version. And so we’ve been working on Helm 4 for a while now.” The latest version includes modernized logging and dependency management, and WebAssembly plugins for portability. Previously, Helm’s plugin system executed out to the file system, a method it still supports. “But we run on a lot of operating systems — Linux, Mac, Windows — and then a lot of architectures,” Farina said. “It’s not just ARM Intel. We’ve got like five or six different Linux architectures that we support now. “So if you’re going to write an extension for that, you need a way to make that portable. And so we’ve kind of churned on different ways we could make it portable over the years. Nothing ever fit. … then this WebAssembly thing came along. It became really, really popular. And so in the last year, we figured out how to make WebAssembly-based plugins for Helm.” Looking ahead, he added, “We re-architected the internals so we can start in [versions] 4.1 ,4.2, 4.3, and start rolling out some really new, nice features around charts and the packages to enable people who are installing applications to have some really neat new ways to control the way it’s installed.” Why ‘Boring’ Features Have Impact Helm 4’s latest upgrades tell a bigger story, Butcher said: how more mature projects have to evolve and adapt as the ecosystem grows and use cases expand. It’s a virtue of a lot of these highly successful open source projects that say they do one thing very, very well. … in our case, we’ve striven for years to be a really, really good package manager for Kubernetes.” But now, “So much of the real work now isn’t defining or redefining what package management is.” Instead, he added, it’s asking “What are the features that are going to help people get stuff done in more effective ways?” Features that are now vital include things like logging. “Back when we created Helm, [it] was like, Oh, well, that’s the boring thing that we’re not really going to think about,” Butcher acknowledged. “Now, it’s like, well, if we can build good logging, then the integration with all these other tools will be more uniform. It’s going to save platform engineers and DevOps folks a lot of time and energy.” Such changes, he said, can be “a time saver, a money saver.” Butcher added, “It might not win any awards for fanciest, flashiest new feature, but it certainly makes a very real difference in the lives of the Helm users.” Check out the full episode to learn more about Helm 4, including how the project maintainers weigh user feedback, and what’s new at Fermyon and SUSE. The post Helm 4: What’s New in the Open Source Kubernetes Package Manager? appeared first on The New Stack.
Read more →
What DocumentDB Means for Open Source
2025-12-03 14:00 | Source: The New Stack
There are at least three reasons why the open source community is paying attention to DocumentDB. The first is that it combines the might of two popular databases: MongoDB (DocumentDB is essentially an open source version of MongoDB) and PostgreSQL. A PostgreSQL extension makes MongoDB’s document functionality available to Postgres; a gateway translates MongoDB’s API to PostgreSQL’s API. Secondly, the schemaless document store is completely free and accessible through the MIT license. The database utilizes the core of Microsoft Azure Cosmos DB for MongoDB, which has been deployed in numerous production settings over the years. Microsoft donated DocumentDB to the Linux Foundation in August. A DocumentDB Kubernetes Operator, enabling the solution to run in the cloud, at the edge, or on premises, was announced at KubeCon + CloudNativeCon NA in November. Thirdly, DocumentDB reinforces a number of vital use cases for generative models, intelligent agents and multiagent instances. These applications entail using the database for session history for agents, conversational history for chatbots and semantic caching for vector stores. According to Karthik Ranganathan, CEO of Yugabyte, which is on the steering committee for the DocumentDB project, these and other employments of the document store immensely benefit from its schema-free implementations. “Mongo gives this core database functionality, what the engine can do,” Ranganathan said. “And then there’s these languages on top that give the developer the flexibility to model these things.” Free From Schema Restrictions The coupling of MongoDB’s technology with PostgreSQL’s is so noteworthy because it effectively combines the relational capabilities of the latter, which Ranganathan termed as “semi-schematic,” with the lack of schema concerns characterizing the former. The freedom to support the aforementioned agent-based and generative model use cases without schema limitations is imperative for maximizing the value of these applications. With DocumentDB, users can avail themselves of this advantage at the foundational database layer. “As everything is going agentic, it’s important to give this capability in the places where you’d be building those applications, as opposed to having a separate way of doing it,” Ranganathan said. For example, if an engineer were constructing a user profile for an application, the lack of schema would only behoove him as he was able to implement multiple fields for a mobile number, office number, fax number and anything else he thought of while coding. “You don’t want a strict schema for that,” Ranganathan said. “You want to just build those fields on the fly.” Multiagent Deployments The lack of schema and general adaptability of the document format are particularly useful for situations in which agents are collaborating. For these applications, DocumentDB can function as a means of providing session history for the various actions and interactions taking place between agents and resources, and between agents with each other. “It’s super, super important for any agent, or any sequence of operations that you work with an agent to accomplish, for the agent to remember what it did,” Ranganathan said. Each of the operations agents perform individually or collectively can be stored in DocumentDB to serve as the memory for agents. Without such a framework, agents would be constantly restarting their tasks. According to German Eichberger, principal software engineering manager at Microsoft, DocumentDB’s viability for this use case extends beyond memory. “As things progress, we’ll have multiple agents working together on transactions,” Eichberger said. “And they will not agree on something, so they’ll have rollbacks. We feel that doing this in a document will be better because they can all work on the same document and when they are happy, commit it.” Such utility is not dissimilar to the way humans work in Google Docs. Chatbots and Semantic Caching There are multiple ways in which DocumentDB underpins other applications of generative models, including Retrieval-Augmented Generation (RAG), vector database deployments and chatbots. For these use cases, the document store can also supply a centralized form of memory for bots discoursing with employees or customers. That way, developers of these systems can avoid situations in which, “If you forget everything we just talked about and just answer the next question, it’s completely out of context and meaningless,” Ranganathan remarked. DocumentDB can also provide a semantic caching layer that preserves the underlying meaning of jargon, pronouns and other facets of episodic memory so intelligent bots can quickly retrieve this information for timelier, more savvy responses. With DocumentDB, such semantic understanding and memory capabilities are baked into the primary resource engineers rely on — the database. “The history of what we talked about, that becomes extremely important,” Ranganathan said. “There’s different ways to solve it, but it must be in the context of the developer ecosystem. So, rather than give one way to solve it and ask everyone to integrate it that way, just give the way the person expects to build the AI application.” What Developers Expect With DocumentDB, developers get the overall flexibility to build applications the way they’d like. The document store is available through PostgreSQL, which is highly extensible and supports an array of workloads, including those involving vector databases and other frameworks for implementing generative models. Moreover, they’re not constrained by any schema limitations, which spurs creativity and a developer-centric means of building applications. Lastly, it provides a reliable mechanism for agents to collaborate with each other, retain the history of what actions were done to perform a task and come to a consensus before completing it. The fact that DocumentDB is free, as well as at the behest of the open source community for these applications of intelligent agents and more, can potentially further the scope of these deployments. “With AI, the growth is going to be exponential, but you’re not going to get there in one hop,” Ranganathan said. “You’ll get there in a series of rapid iterations. The mathematical way to represent it, it’s like 1.1 to the power of 365. This is a 10% improvement every day, which is like 10 raised to the 17th power, a huge number.” DocumentDB may not be solely responsible for such advancements in statistical AI, but it may have contributed to the day’s improvement in this technology. The post What DocumentDB Means for Open Source appeared first on The New Stack.
Read more →
Show HN: HCB Mobile – financial app built by 17 y/o, processing $6M/month
2025-12-03 04:20 | Source: Hacker News
Comments
Read more →
Stack Overflow Puts Community First in New AI Search Tool
2025-12-03 01:30 | Source: The New Stack
Stack Overflow has announced the general availability of AI Assist, a conversational AI search tool that prioritizes human-verified community knowledge over pure large language model (LLM) responses. The GA release follows a successful beta launch at WeAreDevelopers 2025. With the advent of generative AI (GenAI), some folks looked at Stack Overflow as perhaps cooked. However, the company is not going out without a fight. In fact, it’s not going out at all, said Jody Bailey, Stack Overflow’s chief product and technology officer. Differentiators Indeed, a key differentiator for Stack Overflow’s offering over purely LLM-based solutions is the product’s community-first approach. AI Assist searches Stack Overflow content first using an improved re-ranker, then summarizes results with clear attribution to original contributors. AI Assist primarily draws from human-verified knowledge from the Stack Overflow community, ensuring developers get accurate, explainable help, fast and for free. Another differentiator is the product’s hybrid Retrieval-Augmented Generation (RAG) + LLM architecture, where AI acts as an “answer auditor” that supplements community content only when necessary, preventing dead-end results when no relevant content exists. The service also provides transparent attribution. It credits original content creators, honoring Stack Overflow’s commitment to recognizing community contributions. “Most of the large language models are using Stack Overflow data, as most of them have signed agreements with us,” Bailey told The New Stack. Yet these solutions tend to be biased toward the top-voted answer, he said. That’s useful in most cases. “But if you talk with engineers, oftentimes the answer that they really want is three or four answers down the list, so to speak,” Bailey noted. “And the only way that you can really get that is by having the attribution back to the original source of information.” Unlike other GenAI tools, AI Assist stays current with the latest community Q&A from the public platform. It is free to use at stackoverflow.com/ai-assist. More than 250,000 technologists are already using it for debugging, comparing libraries, understanding errors and app architecture. Power users create up to 6,400 messages daily, and 75% are focused on highly technical content. Stack Overflow AI Assist uses OpenAI models for generation, plus proprietary Stack Overflow models for search and re-ranking. It searches Stack Overflow first, then generates summarized answers with attribution, and it falls back to pure LLM generation only when no relevant community content exists. “By providing a trusted, human intelligence layer in the age of AI, we are aiming to serve the broader needs of all technologists while still supporting our larger mission to cultivate community, power learning, and unlock growth,” Bailey said in a statement. “Building this product with a hybrid AI model that prioritizes community content and provides non-negotiable attribution, we’re not just doing AI the ‘right way,’ we’re signaling to the entire industry that humans creating knowledge must be recognized and verified for the betterment of the tech landscape and the world at large.” AI Assist’s Focus According to Stack Overflow, the focus of AI Assist is to: Provide a new way to get started: Offer a modern, conversational alternative for more relevant results. Create a more accessible experience for users, especially new users, looking to get technical help. Enable active learning: Meet users where they are, with reasoning and context. A step towards incorporating educational features to amplify learning. Deepen connectivity: Pave the way for connecting AI Assist with other Stack Overflow pages features, such as Chat and Coding Challenges, and eventual expansion to external tools like IDE extensions and Discord apps, evolving it into a product for developers everywhere. Meanwhile, key learnings from the beta include improved prompt engineering for efficient LLM queries; expanded scope beyond Stack Overflow to include other Stack Exchange sites (math, Ubuntu, etc.); and refined balance between succinct answers and providing context, Bailey said. In addition, Bailey said future integrations for AI Assist include an Model Context Protocol (MCP) server for coding agents (Copilot, Cursor, Replit); a read-only version for public platform; and a bidirectional version for Stack Overflow Internal (announced at Microsoft Ignite). Bailey’s response to the “Stack Overflow is moot” narrative is: “That assumes that every question has already been answered, right? And in my experiences, that’s not the case; we still see lots of new questions.” Part of the vision is to make it easier for developers to ask questions on the site. “Being able to write a good question on Stack Overflow is often a nontrivial matter,” Bailey said. Moreover, “We used to talk about being the third screen for developers, but really the intent now is meeting developers where they are, and we think an MCP server is a way of doing that,” he said. The post Stack Overflow Puts Community First in New AI Search Tool appeared first on The New Stack.
Read more →
HBO Max Butchers ‘Mad Men’ in Botched ‘Remastering’
2025-12-02T23:48:27Z | Source: Daring Fireball
Alan Sepinwall, writing for Wired (News+ link in case Wired’s paywall busts your balls): Last month, HBO Max announced a major new addition to its library. Not only would the streamer be adding Mad Men — a show that HBO execs infamously passed on back when Matthew Weiner was a writer on The Sopranos — but it would be presenting the period drama’s episodes in a new 4K remastering. This would, according to the press release, give “audiences and longtime Mad Men fans the opportunity to enjoy the series’ authentically crafted elements with crisp detail and enhanced visual clarity.” As it turned out, there was perhaps too much clarity. Not long after the series went live on HBO Max, a screencap began floating around social media from a scene in the Season One episode “Red in the Face,” where Roger Sterling is vomiting in front of a group of horrified Sterling Cooper clients. When it aired — and in the version still available on AMC+ — seven men are onscreen, all of them wearing period-appropriate suits and ties. The HBO Max version, on the other hand, features two men who appear very out of place in 1960: crew members lurking in the background, feeding a hose to create the illusion that actor John Slattery is puking. It’s not like the crew members are only partially on-screen, or out of focus far in the background. They’re right there. It’s glaringly obvious that no one at HBO Max even watched this. That’s how rotten the culture at Warner Bros. Discovery is. They obtained the rights to one of the greatest TV shows ever made (one that I personally hold alongside The Sopranos as my favorite ever), processed the episodes in some sort of “remastering” that did not need to happen, and didn’t even bother to watch the fucking new versions they produced before putting them on their service for the world to stream. AMC+ has the entire original series, as originally broadcast, and it looks gorgeous. I bought all seven seasons from iTunes back in the day, and they look as good, if not better, in those versions. David Zaslav — a well-known idiot — should go to prison for this. ★
Read more →
Apple to Resist Order in India to Preload State-Run App on iPhones
2025-12-02T23:05:17Z | Source: Daring Fireball
Aditya Kalra and Munsif Vengattil, reporting for Reuters: Apple does not plan to comply with a mandate to preload its smartphones with a state-owned cyber safety app and will convey its concerns to New Delhi, three sources said, after the government’s move sparked surveillance concerns and a political uproar. The Indian government has confidentially ordered companies including Apple, Samsung and Xiaomi to preload their phones with an app called Sanchar Saathi, or Communication Partner, within 90 days. The app is intended to track stolen phones, block them and prevent them from being misused. The government also wants manufacturers to ensure that the app is not disabled. And for devices already in the supply chain, manufacturers should push the app to phones via software updates, Reuters was first to report on Monday. [...] Apple however does not plan to comply with the directive and will tell the government it does not follow such mandates anywhere in the world as they raise a host of privacy and security issues for the company’s iOS ecosystem, said two of the industry sources who are familiar with Apple’s concerns. They declined to be named publicly as the company’s strategy is private. The second source said Apple does not plan to go to court or take a public stand, but it will tell the government it cannot follow the order because of security vulnerabilities. Apple “can’t do this. Period,” the person said. To my knowledge, there are no government-mandated apps pre-installed on iPhones anywhere in the world. I’m not even sure how that would work, technically, given that third-party apps have to come from the App Store and thus can’t be installed until after the iPhone is configured and the user signs into their App Store Apple Account. The app order comes as Apple is locked in a court fight with an Indian watchdog over the nation’s antitrust penalty law. Apple has said it risks facing a fine of up to $38 billion in a case. This is another one of those laws like the EU’s DMA, where maximum possible fines are based on a percentage of global revenue. No one in India seems to actually be threatening any such fine, but it’s ludicrous that it’s even possible. ★
Read more →
[Sponsor] Protect Your App From Bots and Abuse With WorkOS Radar
2025-12-02T00:14:29Z | Source: Daring Fireball
Does your app get fake signups, throwaway emails, or users abusing your free tier? Or worse, bots attacks and brute force attempts? WorkOS Radar can block all this and more. A simple API gives you advanced device fingerprinting that can detect bad actors, bots, and suspicious behavior. Your users trust you. Let’s keep it that way. ★
Read more →
Gurman Pooh-Poohs Financial Times Report That Tim Cook Is Retiring in First Half of 2026
2025-12-02T00:03:19Z | Source: Daring Fireball
Speaking of Apple executive HR news, in his Power On Bloomberg column last weekend, Mark Gurman pooh-poohed the Financial Times’s recent report that Tim Cook was likely to retire early next year (paywalled, alas, but summarized by MacRumors): In October, I wrote that the internal spotlight on Ternus was “intensifying,” and that barring unforeseen circumstances he would be the leading candidate. But I didn’t put a date on when a change might happen. Then, around midnight two Fridays ago, the Financial Times published a report with three central claims: Apple is “intensifying” succession planning; Ternus is likely the next CEO; and Cook is expected to step down between late January and June. The first two points are anything but revelations if you’ve read Bloomberg coverage and Power On, or have simply been paying attention to the realities of Cook’s age and tenure. The timing, however, is another matter entirely. It’s a huge deal that the FT did this: A respected publication should only predict the CEO transition date for a company of Apple’s scale with a high level of confidence — based on people legitimately in the know. This is where I have concerns. Based on everything I’ve learned in recent weeks, I don’t believe a departure by the middle of next year is likely. In fact, I would be shocked if Cook steps down in the time frame outlined by the FT. Some people have speculated that the story was a “test balloon” orchestrated by Apple or someone close to Cook to prepare Wall Street for a change, but that isn’t the case either. I believe the story was simply false. They can’t both be right. Either the Financial Times or Bloomberg and Gurman will have a serving of claim chowder no later than June. But as Gurman points out, the only disagreement in their reporting is regarding timing: soon vs. soon-ish. It could be that we see something like the following next year. Current board chairman Arthur Levinson turned 75 this year, the suggested age limit for Apple Board members. So maybe he rides off into the sunset and Apple names Cook, who already has a seat on the board, executive chairman. Maybe in February, ahead of Apple’s annual shareholder meeting. Then, in the second half of the year, Cook steps down as CEO, Ternus takes the CEO job, and Cook remains chairman of the board for the next decade or so. One change at a time, with a drip-drip series of leaks to trusted business news publications, like the one to the Financial Times last month — seemingly from the board itself — to make none of it come as a surprise. I don’t think the leak — from multiple sources — to the FT was a “test balloon” (cue John Siracusa on ATP 666 regarding “trial balloon” being the correct idiom). It was more of a “heads up, this is what’s coming”. ★
Read more →
Ivan Sutherland Sketchpad Demo 1963 [video]
2025-12-02 23:13 | Source: Hacker News
Comments
Read more →
The Conversational AI Revolution Is Struggling. Here’s How To Fix It
2025-12-02 23:00 | Source: The New Stack
If 89% of business leaders think their customers are satisfied with their conversational AI experiences, while only 59% of consumers are, we’re facing a wide gap between AI perception and reality. Even with widespread leadership backing, 98% of organizations anticipate changing their conversational AI strategy within the next year, with 58% planning to fully replace their conversational AI solution, according to a new Twilio report, “Inside the Conversational AI Revolution.” You could say these results, from a global survey of 457 business leaders and 4,800 consumers across 15 countries, are just part of growing pains. After all, customer satisfaction with AI-backed chatbots increased 30% over the last quarter, according to the Twilio survey, signaling that, like most AI solutions, conversational AI is on an improvement curve. Some of this customer dissatisfaction can be blamed on persistent data silos — as well as on the disconnect between subject matter experts, like customer support representatives, and the knowledge bases the responses are trained on. It also could be that the built or bought conversational AI tool is too broad, à la ChatGPT-4.0. It’s likely a mix of all these hurdles. Whatever the reason, you’re likely rethinking your conversational AI plan for 2026. The New Stack talked to Andy O’Dower, vice president of product management at Twilio, a cloud communications company, about what these results mean for the near future of your conversational AI strategy. Prioritize Focused and Discrete Use Cases At this stage, there is no one-size-fits-all chatbot. “If I just give a generic chatbot a generic question, I’m going to get a generic response,” O’Dower said. Of course, when chatbots were first launched, that’s all the data they had to work with, most likely whatever Open AI was trained on. Now organizations are adding more and more internal sets, entering what he calls the whack-a-mole stage: “We’re never going to use a generic AI agent that’s going to handle any inquiry that anyone could ever have about our business.” This is all too broad, too fast, he said. It’s better, instead, to nail down the first “discrete, focused,” often most frequently asked questions, he said, to find measurable success and then to expand from there. This can be as simple as customers or patients having difficulty logging into your website or app. If a customer contacts their bank’s customer support saying they have a weird charge on their bank or credit card, “the AI should, in theory, know the place where you want to go,” O’Dower said, based on “the regular charges. And then algorithms go into that — where you buy, what e-commerce sites you use or not, purchase prices and everything that goes into that.” Instead of a customer support rep manually scanning PDFs, the AI should be able to — based on different data sources, including that customer’s purchasing history — pinpoint which purchase is the outlier. Then, an AI agent might be able to suggest next steps of how to contest the expense, or it knows when to escalate the matter to a human right away. “You want to be thinking about these AI agents, not just as a specific replacement of my inbound customer support 800 number. They’re starting to think about this AI agent over time becoming a representation of my entire business.” — Andy O’Dower, Twilio This is in part because, over time, more data sources are being integrated into the chatbot’s conversational AI. O’Dower calls it a customer support pyramid. If an organization has 1,000 support calls, about half will be regular use cases that an AI agent or even simpler automation can answer — like “Did this ship?”— replying with not just the expected arrival date but the shipping info and link to third-party tracking. It can also make excuses or refunds when shipments go wrong. Quick responses to these, without humans in the loop, maintain a higher level of overall customer satisfaction. Then, farther up the pyramid, as customer requests become less frequent yet more complex, a human agent can become what O’Dower calls a “super agent,” backed by an internal conversational AI interface that allows for quicker discovery and resolution. A good conversational overlay on top of correct data can significantly decrease the response time for service representatives. Giving AI access to internal data and having a healthy data ecosystem are two of the capabilities that Google’s 2025 “DORA” report found can amplify the impact of AI adoption. Again, this demands a modular approach, so “you’re not just opening the floodgates of too much data going into an AI agent,” he said. Instead, “identify who these customers and consumers are.” Develop a habit of asking the AI agent what it’s told a consumer, making sure it’s not sharing any personally identifying information. Adopt a Flexible and Modular AI Strategy But the near future of conversational AI is flexible. “The technologies are getting better and better,” O’Dower said. “You as an engineer need to think about this from a modular standpoint” in the face of AI’s continued improvement. “If you’re adding more and more modular data sources that could enrich the 360-degree view of who a customer is, and then you can match that with a better and better language model, then you should get better and better results on the other side.” With everything so new, as I argue in The New Stack’s new AI enterprise strategy playbook, it’s essential to maintain flexibility, avoiding vendor lock-in at all cost. “I could be pitched on a full, out-of-the-box, total solution that does it all for me,” said O’Dower. “That solution might be wed to one existing language model under the hood, or it might be able to only pull from data sources that it has prebuilt integrations with.” This demand for flexibility is in part because you can’t really know which data sources and services will work with your customers until your conversational AI is in production — and you see how often “HUMAN PLEASE” is yelled at your bot. “If I send this out to customers, and I still get X amount of people right away saying that this didn’t solve their problem, I need to escalate to human,” he continued. This is a red flag that your training data is off. With an all-in-one solution, you have to go back to the vendor, requesting more external and internal data sources to train on, which slows progress. Remember: While AI is starting to trigger notable increases in productivity, it is all highly experimental. Different data sources may make an AI agent more successful or not, especially if it has access to different backend services, like a return-a-product service, a billing reminder service or an appointment scheduling service. But if you go for a so-called full-service solution right now, it’s unlikely to have everything you need built in — because you can’t yet predict your needs, or your customers’ behavior. This is why more than half of organizations surveyed in this year’s Twilio report plan to change their conversational AI provider in the next year. “Don’t go from one boxed-in solution to another boxed-in solution,” O’Dower said. “Those language models are going to continue to get better and better and better. You’re going to find more and better data sources that maybe you already have internally that you could plug into this.” Learn To Measure Conversational AI Success Only 63% of “AI mature” organizations — those with AI initiatives remaining in production for three or more years — measure the success of their AI strategy at all, according to a June report by Gartner. You must consider if your chatbot is all talk and no action. Success for conversational AI, O’Dower said, is defined by when its users say it’s working — and when you can prove that it’s working for your business. We all remember early on when a prompt injection had someone trying to buy a Chevy for $1. A “one agent to rule them all” model risks something like this happening again. Instead, you need a fleet of conversational AI agents working together, each specialized in and trained on handling a discrete request. “We need to talk to our AI agent policy expert that knows what we can and can’t do, that then puts in guardrails to any type of large language model [LLM] that says that’s completely out of policy,” O’Dower said. “That AI agent knows all the policies through and through better than a human agent would know out of the gate.” Within these guardrails, he said, an AI agent might be able to create a reasonable benefit, like generating a coupon for a free oil change, not selling a car for a buck. “What we saw in the early days start to happen was to deploy an OpenAI LLM, and just plop it on the website, and let it talk to anybody about any topic,” O’Dower said, “versus a much more discrete, focused, modular strategy where you know you’re gonna change and evolve this.” A lot of success relies on proper leadership around AI strategy. This means, he said, “From the top down, not saying ‘AI everything.'” Instead, O’Dower said, it means “saying, ‘If we are to be successful in automating and giving a much better customer experience, we are going to force teams to work together.'” This can be a dedicated team that has brought together representatives from other teams, he said, including bringing folks “from legal, from data, privacy and security, from customer service, from marketing, from go-to-market teams, the engineering and product teams together, to be able to say, ‘we are going after this use case, and we are going to make an amazing experience with conversational AI, and you will be mapped to this initiative.’” Use Modular Data for a Holistic Customer View We know that data is what’s holding most organizations back from systemic, cross-organizational AI success. In fact, the MIT Project NANDA’s “State of AI in Business 2025” report, released in July, discovered that 95% of AI pilots fall flat due to persistent data silos. Carving out modular data sources, O’Dower said, also support the cultivation of a 360-degree view of the customer. “Not just one big block of data — that doesn’t make sense because you need to match up these data points,” across different departments, O’Dower said. Instead, he added, a cross-organizational data strategy must consider, “How does that interact with what our policies are and what our marketing and promotions are? The better that data is structured, the far better results you’re going to get on the other side, when you match it up with an AI agent.” This modularity — in contrast to a one-size-fits-all AI customer support solution — also allows your conversational AI stack to connect with different services, from payment solutions to appointment scheduling bots. Soon, he predicted, the customer experience will become more flexible, including intuitive conversational AI that works via voice when you’re driving or via text when you’re on do not disturb or work mode. These advances will also see more customers supported in more spoken languages. If 58% of organizations intend to replace their conversational AI solution completely in 2026, as Twilio reported, that’s a sign that you need flexibility in response to this still-emerging space. And it means you’re more likely than not to make a chatbot change next year, too. It’s also time to start thinking strategically about your conversational AI. As O’Dower said, think “about these AI agents not as just a specific replacement of my inbound customer support 800 number. They’re starting to think about this AI agent over time becoming a representation of my entire business.” The post The Conversational AI Revolution Is Struggling. Here’s How To Fix It appeared first on The New Stack.
Read more →
Why are your models so big? (2023)
2025-12-02 22:25 | Source: Hacker News
Comments
Read more →
With Nova Forge, AWS Makes Building Custom AI Models Easy
2025-12-02 21:09 | Source: The New Stack
Las Vegas — At this annual re:Invent conference, AWS today launched new versions of its Nova models that are comparable with the latest model releases from frontier labs like Anthropic, OpenAI and Google. But for many use cases, off-the-shelf large language models (LLM) don’t solve their specific use cases and may not be able to deliver the reliability needed to put a given AI workload into production. Fine-tuning existing open-weight models is often the only option here, but that takes a lot of expertise and can come with its own pitfalls (like model regressions). For these companies, AWS today launched Nova Forge, a new service that allows businesses to bring their own data to AWS’s own Nova models, starting with Nova 2 Lite (it won’t work with any other models). They will have to pay $100,000 per year for the privilege, though. AWS calls this idea “open training,” though it’s worth noting that the Nova models are not open-weight or open source models. Image credit: AWS. Nova Forge gives users access to pre-, mid-, and post-trained Nova models and lets them mix in their proprietary data and Amazon-curated datasets at each of these stages. Ideally, this means that the model will retain all of its existing knowledge, but now also has a better understanding of a given organisation’s specific needs and knowledge base. That’s useful, especially for use cases where the knowledge base doesn’t change frequently, in which case, a more traditional RAG architecture would likely make more sense. If anything, though, these new custom models (AWS calls them”novellas”) will provide even RAG-based systems with a more customised basis to start from. Using reinforcement learning (RL), users can then also fine-tune the models’ responses even further. Users will also have the option to create smaller, distilled models that will be more cost-effective to run, and all of this is backed by AWS’s responsible AI toolkit that helps businesses ensure the models have the necessary safety controls in place. AWS Nova Forge (Credit: AWS). “Working with Nova Forge is allowing us to improve content moderation on Reddit with a more unified system that’s already delivering impressive results,” said Chris Slowe, the CTO of Reddit. “We’re replacing a number of different models with a single, more accurate solution that makes moderation more efficient. The ability to replace multiple specialised ML workflows with one cohesive approach marks a shift in how we implement and scale AI across Reddit. After seeing these early successes in our safety efforts, we’re eager to explore how Nova Forge might help in other areas of our business.” As of now, the only place to deploy these models is Amazon Bedrock, though. It doesn’t look like they’ll be able to take them out of these environments and run them elsewhere. AWS, of course, argues that this ensurs security, scalability and data privacy, but I wouldn’t be surprised if, over time, the company would allow Forge users to take their models elsewhere as well. The post With Nova Forge, AWS Makes Building Custom AI Models Easy appeared first on The New Stack.
Read more →
Frinkiac – 3M "The Simpsons" Screencaps
2025-12-02 20:28 | Source: Hacker News
Comments
Read more →
“The local-first rebellion”: How Home Assistant became the most important project in your house
2025-12-02 17:19 | Source: GitHub Engineering
Franck Nijhof—better known as Frenck—is one of those maintainers who ended up at the center of a massive open source project not because he chased the spotlight, but because he helped hold together one of the most active, culturally important, and technically demanding open source ecosystems on the planet. As a lead of Home Assistant and a GitHub Star, Frenck guides the project that didn’t just grow. It exploded. This year’s Octoverse report confirms it: Home Assistant was one of the fastest-growing open source projects by contributors, ranking alongside AI infrastructure giants like vLLM, Ollama, and Transformers. It also appeared in the top projects attracting first-time contributors, sitting beside massive developer platforms such as VS Code. In a year dominated by AI tooling, agentic workflows, and typed language growth, Home Assistant stood out as something else entirely: an open source system for the physical world that grew at an AI-era pace. The scale is wild. Home Assistant is now running in more than 2 million households, orchestrating everything from thermostats and door locks to motion sensors and lighting. All on users’ own hardware, not the cloud. The contributor base behind that growth is just as remarkable: 21,000 contributors in a single year, feeding into one of GitHub’s most lively ecosystems at a time when a new developer joins GitHub every second. In our podcast interview, Frenck explains it almost casually. Home Assistant is a free and open source home automation platform. It allows you to connect all your devices together, regardless of the brands they’re from… And it runs locally. Franck Nijhof, lead of Home Assistant He smiles when he describes just how accessible it is. “Flash Home Assistant to an SD card, put it in, and it will start scanning your home,” he says. This is the paradox that makes Home Assistant compelling to developers: it’s simple to use, but technically enormous. A local-first, globally maintained automation engine for the home. And Frenck is one of the people keeping it all running. 📌 What is Home Assistant? Home Assistant is an open-source home automation platform designed for maximum local control, privacy, and interoperability. It enables you to connect, orchestrate, and automate thousands of devices from hundreds of vendors—all running on hardware in your home (e.g., a Raspberry Pi) and without sending data to the cloud. The core engine is written in Python and supported by front-end components in TypeScript and other languages. Developers build integrations in a community-wide effort that has grown to tens of thousands of contributors and millions of installations. The architecture built to tame thousands of device ecosystems At its core, Home Assistant’s problem is combinatorial explosion. The platform supports “hundreds, thousands of devices… over 3,000 brands,” as Frenck notes. Each one behaves differently, and the only way to normalize them is to build a general-purpose abstraction layer that can survive vendor churn, bad APIs, and inconsistent firmware. Instead of treating devices as isolated objects behind cloud accounts, everything is represented locally as entities with states and events. A garage door is not just a vendor-specific API; it’s a structured device that exposes capabilities to the automation engine. A thermostat is not a cloud endpoint; it’s a sensor/actuator pair with metadata that can be reasoned about. That consistency is why people can build wildly advanced automations. Frenck describes one particularly inventive example: “Some people install weight sensors into their couches so they actually know if you’re sitting down or standing up again. You’re watching a movie, you stand up, and it will pause and then turn on the lights a bit brighter so you can actually see when you get your drink. You get back, sit down, the lights dim, and the movie continues.” A system that can orchestrate these interactions is fundamentally a distributed event-driven runtime for physical spaces. Home Assistant may look like a dashboard, but under the hood it behaves more like a real-time OS for the home. Running everything locally is not a feature. It’s a hard constraint. Almost every mainstream device manufacturer has pivoted to cloud-centric models. Frenck points out the absurdity: It’s crazy that we need the internet nowadays to change your thermostat. The local-first architecture means Home Assistant can run on hardware as small as a Raspberry Pi but must handle workloads that commercial systems offload to the cloud: device discovery, event dispatch, state persistence, automation scheduling, voice pipeline inference (if local), real-time sensor reading, integration updates, and security constraints. This architecture forces optimizations few consumer systems attempt. If any of this were offloaded to a vendor cloud, the system would be easier to build. But Home Assistant’s philosophy reverses the paradigm: the home is the data center. Everything from SSD wear leveling on the Pi to MQTT throughput to Zigbee network topologies becomes a software challenge. And because the system must keep working offline, there’s no fallback. This is engineering with no safety net. The open home foundation: governance as a technical requirement When you build a system that runs in millions of homes, the biggest long-term risk isn’t bugs. It’s ownership. “It can never be bought, it can never be sold,” Frenck says of Home Assistant’s move to the Open Home Foundation. “We want to protect Home Assistant from the big guys in the end.” This governance model isn’t philosophical; it is an architectural necessity. If Home Assistant ever became a commercial acquisition, cloud lock-in would follow. APIs would break. Integrations would be deprecated. Automations built over years would collapse. The Foundation encodes three engineering constraints that ripple through every design decision: Privacy: “Local control and privacy first.” All processing must occur on-device. Choice: “You should be able to choose your own devices” and expect them to interoperate. Sustainability: If a vendor kills its cloud service, the device must still work. Frenck calls out Nest as an example: “If some manufacturer turns off the cloud service… that turns into e-waste.” This is more than governance; it is technical infrastructure. It dictates API longevity, integration strategy, reverse engineering priorities, and local inference choices. It’s also a blueprint that forces the project to outlive any individual device manufacturer. The community model that accidentally solved software quality We don’t build Home Assistant, the community does. “We cannot build hundreds, thousands of device integrations. I don’t have tens of thousands of devices in my home,” Frenck says. This is where the project becomes truly unique. Developers write integrations for devices they personally own. Reviewers test contributions against devices in their own homes. Break something, and you break your own house. Improve something, and you improve your daily life. “That’s where the quality comes from,” Frenck says. “People run this in their own homes… and they take care that it needs to be good.” This is the unheard-of secret behind Home Assistant’s engineering velocity. Every contributor has access to production hardware. Every reviewer has a high-stakes environment to protect. No staging environment could replicate millions of real homes, each with its own weird edge cases. Assist: A local voice assistant built before the AI hype wave Assist is Home Assistant’s built-in voice assistant, a modular system that lets you control your home using speech without sending audio or transcripts to any cloud provider. As Frenck puts it: We were building a voice assistant before the AI hype… we want to build something privacy-aware and local. Rather than copying commercial assistants like Alexa or Google Assistant, Assist takes a two-layer approach that prioritizes determinism, speed, and user choice. Stage 1: Deterministic, no-AI commands Assist began with a structured intent engine powered by hand-authored phrases contributed by the community. Commands like “Turn on the kitchen light” or “Turn off the living room fan” are matched directly to known actions without using machine learning at all. This makes them extremely fast, reliable, and fully local. No network calls. No cloud. No model hallucinations. Just direct mapping from phrase to automation. Stage 2: Optional AI when you want natural language One of the more unusual parts of Assist is that AI is never mandatory. Frenck emphasizes that developers and users get to choose their inference path: “You can even say you want to connect your own OpenAI account. Or your own Google Gemini account. Or get a Llama running locally in your own home.” Assist evaluates each command and decides whether it needs AI. If a command is known, it bypasses the model entirely. “Home Assistant would be like, well, I don’t have to ask AI,” Frenck says. “I know what this is. Let me turn off the lights.” The system only uses AI when a command requires flexible interpretation, making AI a fallback instead of the foundation. Open hardware to support the system To bootstrap development and give contributors a reference device, the team built a fully open source smart speaker—the Voice Assistant Preview Edition. “We created a small speaker with a microphone array,” Frenck says. “It’s fully open source. The hardware is open source; the software running on it is ESPHome.” This gives developers a predictable hardware target for building and testing voice features, instead of guessing how different microphones, DSP pipelines, or wake word configurations behave across vendors. Hardware as a software accelerator Most open source projects avoid hardware. Home Assistant embraced it out of practical necessity. “In order to get the software people building the software for hardware, you need to build hardware,” Frenck says. Home Assistant Green, its prebuilt plug-and-play hub, exists because onboarding requires reliable hardware. The Voice Assistant Preview Edition exists because the voice pipeline needs a known microphone and speaker configuration. This is a rare pattern: hardware serves as scaffolding for software evolution. It’s akin to building a compiler and then designing a reference CPU so contributors can optimize code paths predictably. The result is a more stable, more testable, more developer-friendly software ecosystem. A glimpse into the future: local agents and programmable homes The trajectory is clear. With local AI models, deterministic automations, and a stateful view of the entire home, the next logical step is agentic behavior that runs entirely offline. If a couch can trigger a movie automation, and a brewery can run a fermentation pipeline, the home itself becomes programmable. Every sensor is an input. Every device is an actuator. Every automation is a function. The entire house becomes a runtime. And unlike cloud-bound competitors, Home Assistant’s runtime belongs to the homeowner, not the service provider. Frenck sums up the ethos: “We give that control to our community.” Looking to stay one step ahead? Read the latest Octoverse report and consider trying Copilot CLI. The post “The local-first rebellion”: How Home Assistant became the most important project in your house appeared first on The GitHub Blog.
Read more →
Compassionate Curmudgeon: Why we must root ourselves in the real world
2025-12-02 08:33 | Source: Hacker News
Comments
Read more →
John Giannandrea Is Out
2025-12-01T22:50:33Z | Source: Daring Fireball
Apple Newsroom, “John Giannandrea to Retire From Apple”: Apple today announced John Giannandrea, Apple’s senior vice president for Machine Learning and AI Strategy, is stepping down from his position and will serve as an advisor to the company before retiring in the spring of 2026. Apple also announced that renowned AI researcher Amar Subramanya has joined Apple as vice president of AI, reporting to Craig Federighi. Subramanya will be leading critical areas, including Apple Foundation Models, ML research, and AI Safety and Evaluation. The balance of Giannandrea’s organization will shift to Sabih Khan and Eddy Cue to align closer with similar organizations. After the fiasco around Apple Intelligence and the “more personalized Siri” features — which were announced at WWDC in June 2024, but postponed until 2026 in a tail-between-their-legs announcement in March 2025 — and the executive reshuffling immediately after that delay was announced that put Mike Rockwell in charge of Siri and moved all or most of Apple Intelligence and Siri under Craig Federighi, it would have been much more surprising if Giannandrea had stayed at Apple. In fact, I’m surprised he wasn’t out before WWDC this past June. I don’t think we need to wait for additional details to know that he was squeezed out. If, as Mark Gurman reported back in March, “Tim Cook has lost confidence in the ability of AI head John Giannandrea to execute on product development”, why was he still there? As for Subramanya, according to his LinkedIn profile, he was at Google for 16 years, and left to join Microsoft only five months ago. Either he didn’t like working at Microsoft, or Apple made him an offer he couldn’t refuse (or, perhaps, both). ★
Read more →
★ Signal Secure Backups Are Now Available on iOS
2025-12-01T22:38:17Z | Source: Daring Fireball
Signal Support: Signal Secure Backups can help you safely restore your chats if something unexpected happens to your device (like dropping your phone in a lake). When this optional feature is enabled, your device will automatically back up your message history so you won’t lose important data if you get a new phone or reinstall Signal. Your Secure Backup Archive is end-to-end encrypted and protected by a cryptographically secure 64-character recovery key that is never shared with the Signal service. Without your unique recovery key, no one (including Signal) can read, decrypt, or restore any of the data in your Secure Backup Archive. Signal’s cloud storage service is optional (of course), and available to all users free of charge. At the free tier, it will back up the complete text of users’ chat history and the last 45 days of file attachments (images, video, etc.). For $2/month (through in-app purchase in the iPhone app), Signal will remove the 45-day window on media attachments, and store up to 100 GB of attachments — which, for most users, should be their complete history. (I don’t remember how far back in time my iCloud iMessage storage goes, but, as I type this, it includes 772,004 messages and consumes 83.4 GB of storage. I have a lot of images in there. 100 GB of storage feels pretty good for $2/month. My personal Signal account backup size is just 408 MB, which jibes with my gut feeling regarding how much I use Signal compared to iMessage — about one-half of one percent as much.) Signal first announced this feature back in September in a blog post that has a lot of technical details about how it works, but until a week ago, it was only available on the Android version. It’s still labelled as a “beta” feature on iOS. I enabled it over the weekend and signed up for the $2/month subscription — both to back up all my attachments and to support the Signal Foundation. Now that I’m paying $2/month, however, I wish they’d stop periodically badgering me for donations when I launch the app. I’m glad this feature became available when it did, and that I enabled it over the weekend. Yesterday I set up my personal new iPhone this year, and this morning, when I tried to transfer my Signal account from my old iPhone to the new one, after claiming to reach “100%” of the transfer, and the Signal app reporting on both the old (source) and new (destination) phones that the transfer was complete, the app crashed on both phones. After that, the Signal app was in factory-fresh state on both phones, without any trace of my account history. I then restored the new iPhone from my brand-new online Signal Secure Backup, and that worked perfectly. And it somehow took far, far less time than the old device-to-device transfer — maybe one minute, versus 15 minutes or so for the device-to-device transfer that wound up failing. Until now, transferring my Signal account history from one phone to another always felt like delivering a crate full of eggs while riding a rickety old bicycle without brakes on a bumpy cobblestone street. Every time I did it device-to-device, it felt like I’d be lucky if it worked. And my experience trying it this morning — for the last time — proved me right. Signal proponents often defended this architecture by arguing that remaining only on device was a security benefit. In some ways that’s true, but there’s nothing “secure” about a transfer feature that loses all of your data if the transfer fails. (Signal data, by design, isn’t included in iCloud backups because Apple holds a key to unlock iCloud backups for customer service reasons, unless the user has enabled Advanced Data Protection.) Permanently losing all your data is a different form of “insecurity” than having it exfiltrated by an attacker or exposed to law enforcement agencies via a warrant issued to the cloud backup provider, but it’s a form of insecurity nonetheless. Signal’s top priority has always been protecting your data from being obtained by others. That’s a noble idea, and central to Signal’s brand. But by placing that priority so far above everything else, it meant, until now, that you’d lose your entire account history if you lost or broke your primary phone. This new secure backup system shows that your data can remain secure while also being backed up off device. I’m glad the feature is finally here, but it should have been here years ago. A user-hostile “lose your phone, lose your account history” architecture may well be “secure” in a technical sense, but it’s the sort of brittleness that’s kept Signal from achieving more mainstream use.
Read more →
The Talk Show: ‘Financial Boner’
2025-12-01T00:39:09Z | Source: Daring Fireball
Special guest: Tyler Hayes. Topics include how to get a small phone today, which way foldables should fold, the state of Apple TV (including its new “sonic logo”), and some holiday gift gadget recommendations. Sponsored by: Clic for Sonos: No lag. No hassle. Just Clic. Squarespace: Save 10% off your first purchase of a website or domain using code talkshow. ★
Read more →
Tides are weirder than you think
2025-12-01 20:14 | Source: Hacker News
Comments
Read more →
How to orchestrate agents using mission control
2025-12-01 17:00 | Source: GitHub Engineering
We recently shipped Agent HQ’s mission control, a unified interface for managing GitHub Copilot coding agent tasks. Now, you can now assign tasks to Copilot across repos, pick a custom agent, watch real‑time session logs, steer mid-run (pause, refine, or restart), and jump straight into the resulting pull requests—all in one place. Instead of bouncing between pages to see status, rationale, and changes, mission control centralizes assignment, oversight, and review. Having the tool is one thing. Knowing how to use it effectively is another. This guide shows you how to orchestrate multiple agents, when to intervene, and how to review their work efficiently. Being great at orchestrating agents means unblocking parallel work in the same timeframe you’d spend on one task, stepping in when logs show drift, tests fail, or scope creeps. The mental model shift From sequential to parallel If you’re already used to working with an agent one at a time, you know it’s inherently sequential. You submit a prompt, wait for a response, review it, make adjustments, and move to the next task. Mission control changes this. You can kick off multiple tasks in minutes—across one repo or many. Previously, you’d navigate to different repos, open issues in each one, and assign Copilot separately. Now you can enter prompts in one place, and Copilot coding agent goes to work across all of them. That being said, there is a trade-off to keep in mind: Instead of each task taking30 seconds to a few minutes to complete, your agents might spend a few minutes to an hour on a draft. But you’re no longer just waiting. You’re orchestrating. When to stay sequential Not everything belongs in parallel. Use sequential workflows when: Tasks have dependencies You’re exploring unfamiliar territory Complex problems require validating assumptions between steps When assigning multiple tasks from the same repo, consider overlap. Agents working in parallel can create merge conflicts if they touch the same files. Be thoughtful about partitioning work. Tasks that typically run well in parallel: Research work (finding feature flags, configuration options) Analysis (log analysis, performance profiling) Documentation generation Security reviews Work in different modules or components Tips for getting started The shift is simple: you move from waiting on a single run to overseeing multiple progressing in parallel, stepping in for failed tests, scope drift, or correcting unclear intent where guidance will save time. Write clear prompts with context Specificity matters. Describe the task precisely. Good context remains critical for good results. Helpful context includes: Screenshots showing the problem Code snippets illustrating the pattern you want Links to relevant documentation or examples Weak prompt: “Fix the authentication bug.” Strong prompt: “Users report ‘Invalid token’ errors after 30 minutes of activity. JWT tokens are configured with 1-hour expiration in auth.config.js. Investigate why tokens expire early and fix the validation logic. Create the pull request in the api-gateway repo.” Use custom agents for consistency Mission control lets you select custom agents that use agents.md files from your selected repo. These files give your agent a persona and pre-written context, removing the burden of constantly providing the same examples or instructions. If you manage repos where your team regularly uses agents, consider creating agents.md files tailored to your common workflows. This ensures consistency across tasks and reduces the cognitive load of crafting detailed prompts each time. Once you’ve written your prompt and selected your custom agent (if applicable), kick off the task. Your agent gets to work immediately. How to write a great agents.md Most agents.md files fail because they are too vague. I analyzed 2,500 agent instruction files to learn what the good ones were doing differently. Read the guide to see what makes those stand out, and which agents you should consider building today. Tips for active orchestration You’re now a conductor of agents. Each task might take a minute or an hour, depending on complexity. You have two choices: watch your agents work so you can intervene if needed, or step away and come back when they’re done. Reading the signals Below are some common indicators that your agent is not on the right track and needs additional guidance: Failing tests, integrations, or fetches: The agent can’t fetch dependencies, authentication fails, or unit tests break repeatedly. Unexpected files being created: Files outside the scope appear in the diff, or the agent modifies shared configuration. Scope creep beyond what you requested: The agent starts refactoring adjacent code or “improving” things you didn’t ask for. Misunderstanding your intent: The session log reveals the agent interpreted your prompt differently than you meant. Circular behavior: The agent tries the same failing approach multiple times without adjusting. When you spot issues, evaluate their severity. Is that failing test critical? Does that integration point matter for this task? The session log typically shows intent before action, giving you a chance to intervene if you’re monitoring. The art of steering When you need to redirect an agent, be specific. Explain why you’re redirecting and how you want it to proceed. Bad steering: “This doesn’t look right.” Good steering: “Don’t modify database.js—that file is shared across services. Instead, add the connection pool configuration in api/config/db-pool.js. This keeps the change isolated to the API layer.” Timing matters. Catch a problem five minutes in, and you might save an hour of ineffective work. Don’t wait until the agent finishes to provide feedback. You can also stop an agent mid-task and give it refined instructions. Restarting with better direction is simple and often faster than letting a misaligned agent continue. Why session logs matter Session logs show reasoning, not just actions. They reveal misunderstandings before they become pull requests, and they improve your future prompts and orchestration practices. When Copilot says “I’m going to refactor the entire authentication system,” that’s your cue to steer. Tips for the review phase When your agents finish, you’ll have pull requests to review. Here’s how to do it efficiently. Ensure you review: Session logs: Understand what the agent did and why. Look for reasoning errors before they become merged code. Did the agent misinterpret your intent? Did it assume something incorrectly? Files changed: Review the actual code changes. Focus on: Files you didn’t expect to see modified Changes that touch shared, risky, or critical code paths Patterns that don’t match your team’s standards/practices Missing edge case handling Checks: Verify that tests pass (your unit tests, Playwright, CI/CD, etc.). When checks fail, don’t just restart the agent. Investigate why. A failing test might reveal the agent misunderstood requirements, not just wrote buggy code. This pattern gives you the full picture: intent, implementation, and validation. Ask Copilot to review its own work After an agent completes a task, ask it: “What edge cases am I missing?” “What test coverage is incomplete?” “How should I fix this failing test?” Copilot can often identify gaps in its own work, saving you time and improving the final result. Treat it like a junior developer who’s willing to explain their reasoning. Batch similar reviews Generating code with agents is straightforward. Reviewing that code—ensuring it meets your standards, does what you want, and that it can be maintained by your team—still requires human judgment. Improve your review process by grouping similar work together. Review all API changes in one session. Review all documentation changes in another. Your brain context-switches less, and you’ll spot patterns and inconsistencies more easily. What’s changed for the better Mission control moves you from babysitting single agent runs to orchestrating a small fleet. You define clear, scoped tasks. You supply just enough context. You launch several agents. The speed gain is not that each task finishes faster; it’s that you unblock more work in the same timeframe. What makes this possible is discipline: specific prompts, not vague requests. Custom agents in agents.md that carry your patterns so you don’t repeat yourself. Early steering when session logs show drift. Treating logs as reasoning artifacts you mine to write a sharper next prompt. And batching reviews so your brain stays in one mental model long enough to spot subtle inconsistencies. Lead your own team of agents to create something great! Ready to start? Visit mission control or learn more about GitHub Copilot for your organization. The post How to orchestrate agents using mission control appeared first on The GitHub Blog.
Read more →
Idempotency Keys for Exactly-Once Processing
2025-12-01 12:07 | Source: Hacker News
Comments
Read more →
Dekáf Coffee Roasters
2025-11-29T22:42:00Z | Source: Daring Fireball
My thanks to Dekáf for sponsoring Daring Fireball this week. They’ve just launched a nice lineup of holiday gift bundles — curated sets of their most-loved coffees that make gift-buying easy. Nine single origins. Six signature blends. Four Mizudashi cold brews. All micro-lot and top-rated coffees are shipped within 24 hours of roasting. No shortcuts. No crash. Dekáf is coffee at its most refined, just without the caffeine. I’ve gone through a few bags, and each one tasted great — like high quality regular coffee. And, there’s a special offer just for DF readers: get 20% off with code DF. ★
Read more →
Festivitas — Now for iOS, Thanks to Widgets
2025-11-29T22:41:38Z | Source: Daring Fireball
Last year developer Simon Støvring launched a fun new app for the Mac called Festivitas, which let you decorate your menu bar and Dock with animated holiday lights and falling snow. This year he’s added an iOS version for iPhone and iPad that lets you create widgets to decorate your home screens with holidays lights and festive photo frames. Pure fun. See also: Jason Snell on using Festivitas’s Shortcuts support to create an automation that gives a 10 percent chance of snow every 20 minutes. Støvring’s own Shortcuts examples (available in the app’s Settings window) include things like turning on the lights when music starts playing. With support for Shortcuts, users can create their own fun. ★
Read more →
‘A Critter Carol’ — Apple’s 2025 Holiday Short Film
2025-11-29T22:20:37Z | Source: Daring Fireball
Delightful, and there’s an equally delightful behind-the-scenes video. ★
Read more →
How fast can browsers process base64 data?
2025-11-29 07:30 | Source: Hacker News
Comments
Read more →
‘Fifteen Years’
2025-11-28T02:08:30Z | Source: Daring Fireball
A masterpiece from Randall Munroe, perfect for Thanksgiving. ★
Read more →
The ultimate gift guide for the developer in your life
2025-11-28 13:08 | Source: GitHub Engineering
Finding the right gift for the developers on your list shouldn’t feel like chasing an intermittent bug. We’re here to give you our top tips for finding the perfect gift for them. (Or just for you, you know you deserve it.) From vibe code to holiday mode Move straight from shipping code to holiday mode with the ugly holiday socks and beanie. Made with 49% merino wool, the socks keep your feet warm, while the 100% wool beanie handles the rest. Pair them with the ugly sweater to finish the look.🎁 And they’re all included in our Black Friday sale—so grab them before they’re gone. The answer to life’s biggest questions “Should I push to prod on New Year’s Eve?” “Should I eat more of my favorite holiday dish?” “Should I create another side project in 2026?” The GitHub Copilot Amazeball will have the answer for these questions, and many, many more. It’ll get you on your path. Is it the right path? We don’t have the answer for that. Stay hydrated (and caffeinated) A good day starts with a coffee, right? We’ve got the solution to all your hot beverage needs, whether you’re on the go with our Invertocat orb bottle, at your desk with our “Ship It” diner mug (on sale now) or deep in the forest with our Invertocat MiiR camp mug (also works on the school run). And, of course, staying hydrated is important too, and nothing will help you more than our Stanley cup or Invertocat Asobu Marina tumbler (which also has a coffee cup built in, just saying). Get them both in our Black Friday sale. Level up your workspace Have you seen our latest key caps? They may not help you ship code faster, but they do look great. They’re an easy office gift or the perfect extra touch for a holiday present. The recycled desk mat helps your mouse glide smoothly across your workspace. Sophisticated and sleek… well, not exactly. This desk mat is loud, proud, and unmistakably GitHub—with Octocats from edge to edge. A fan favorite on social media, our MiiR backpack is here to get you to the office, to the plane and everywhere in between. A holiday must-have. For future builders Encourage curiosity early. For the little builders in your life, we have our very cosy youth ASCII Invertocat pullover, along with our ASCII Invertocat tee. Want to do some holiday matching? We have the ASCII Cube tee in both youth and adult sizes. Take advantage of our Black Friday sale Giving the perfect gift feels great. Getting it at a solid discount feels even better. From November 26 to December 7, our Black Friday sale offers markdowns on some of these picks and plenty more across the shop. See everything on sale, and check our holiday order deadlines to ensure your gifts arrive on time. From all of us at the GitHub Shop: here’s to a December packed with good gifts, good energy, and the occasional sprint to finish that last bit of code. Have a joyful holiday season. The post The ultimate gift guide for the developer in your life appeared first on The GitHub Blog.
Read more →
David Lerner, Co-Founder of Tekserve, Dies at 72
2025-11-27T00:31:44Z | Source: Daring Fireball
Sam Roberts, reporting for The New York Times: David Lerner, a high school dropout and self-taught computer geek whose funky foothold in New York’s Flatiron district, Tekserve, was for decades a beloved discount mecca for Apple customers desperate to retrieve lost data and repair frozen hard drives, died on Nov. 12 at a hospital in Manhattan. He was 72. [...] Tekserve specialized in finding the cures for sick computers — including insect infestations — and recovering first novels and other priceless data, which the company said it was able to do about 85 percent of the time. “We only charged for success,” Mr. Lerner said. There were many great independent Apple resellers from the pre-Apple-Store era. There was only one that was legendary: Tekserve. ★
Read more →
Running to the Press
2025-11-26T23:55:20Z | Source: Daring Fireball
Regarding my earlier post on similarities between the 2010 App Store Guidelines and today’s: Notably absent from the current guidelines (I think for a very long time) is the specious but very Jobsian claim that “If you run to the press and trash us, it never helps.” Getting the press on your side is one of the best ways for a developer to get an unjust App Store review decision overturned. Apple loathes negative publicity. ★
Read more →
November Update to the App Store Review Guidelines
2025-11-26T21:46:25Z | Source: Daring Fireball
Here’s the updated full guideline for section 4.1: 4.1 Copycats (a) Come up with your own ideas. We know you have them, so make yours come to life. Don’t simply copy the latest popular app on the App Store, or make some minor changes to another app’s name or UI and pass it off as your own. In addition to risking an intellectual property infringement claim, it makes the App Store harder to navigate and just isn’t fair to your fellow developers. (b) Submitting apps which impersonate other apps or services is considered a violation of the Developer Code of Conduct and may result in removal from the Apple Developer Program. (c) You cannot use another developer’s icon, brand, or product name in your app’s icon or name, without approval from the developer. It’s guideline (c) that’s new, but I like guideline (a) here. Not just the intent of it, but the language. It’s clear, direct, and human. It reminds me of the tone of the very early guidelines, when it seemed like Steve Jobs’s voice was detectable in some of them. In a post back in 2010, I wrote: This new document is written in remarkably casual language. For example, a few bullet items from the beginning: We have over 250,000 apps in the App Store. We don’t need any more Fart apps. If your app doesn’t do something useful or provide some form of lasting entertainment, it may not be accepted. If your App looks like it was cobbled together in a few days, or you’re trying to get your first practice App into the store to impress your friends, please brace yourself for rejection. We have lots of serious developers who don’t want their quality Apps to be surrounded by amateur hour. We will reject Apps for any content or behavior that we believe is over the line. What line, you ask? Well, as a Supreme Court Justice once said, “I’ll know it when I see it”. And we think that you will also know it when you cross it. If your app is rejected, we have a Review Board that you can appeal to. If you run to the press and trash us, it never helps. Some of that language remains today. Here’s the current guideline for section 4.3: 4.3 Spam [...] (b) Also avoid piling on to a category that is already saturated; the App Store has enough fart, burp, flashlight, fortune telling, dating, drinking games, and Kama Sutra apps, etc. already. We will reject these apps unless they provide a unique, high-quality experience. Spamming the store may lead to your removal from the Apple Developer Program. I could be wrong, but my sense is that Apple has, without much fanfare, cracked down on scams and rip-offs in the App Store. That doesn’t mean there’s none. But it’s like crime in a city: a low amount of crime is the practical ideal, not zero crime. Maybe Apple has empowered something like the “bunco squad” I’ve wanted for years? If I’m just unaware of blatant rip-offs running wild in the App Store, send examples my way. ★
Read more →
Simple Rule of Thumb: AI Systems Shouldn’t Pretend to Be Human
2025-11-25T02:17:24Z | Source: Daring Fireball
Dave Winer: The new Amazon Alexa with AI has the same basic problem of all AI bots, it acts as if it’s human, with a level of intimacy that you really don’t want to think about, because Alexa is in your house, with you, listening, all the time. Calling attention to an idea that there’s a pseudo-human spying on you is bad. Alexa depends on the opposite impression, that it’s just a computer. I think AI’s should give up the pretense that they’re human, and this one should be first. Amen. ★
Read more →
[Sponsor] Dekáf Coffee Roasters — Holiday Gift Bundles
2025-11-25T02:02:12Z | Source: Daring Fireball
Meet our new Holiday Gift Bundles, curated sets of our most loved coffees designed for effortless gifting. Nine single origins. Six signature blends. Four Mizudashi cold brews. All micro-lot and top-rated coffees shipped within 24 hours of roasting. No shortcuts. No crash. This is coffee at its most refined, just without the caffeine. DF readers get 20% off with code DF at dekaf.com/s/df. We’re betting you’ll never look at decaf the same way again. But that’s kind of the point. ★
Read more →
‘A Worthless, Poisoned Hall of Mirrors’
2025-11-25T01:48:34Z | Source: Daring Fireball
Charlie Warzel, writing for The Atlantic: X’s decision to show where accounts are based is, theoretically, a positive step in the direction of transparency for the platform, which has let troll and spam accounts proliferate since Musk’s purchase, in late 2022. And yet the scale of the deception — as revealed by the “About” feature — suggests that in his haste to turn X into a political weapon for the far right, Musk may have revealed that the platform he’s long called “the number 1 source of news on Earth” is really just a worthless, poisoned hall of mirrors. Max Berger, on Bluesky: If I’m understanding this correctly, X is owned by a white nationalist who pays poor people of color in developing countries to pretend to be working class white Americans to scare other white Americans into being afraid poor people of color from developing countries are going to ruin America? Pretty much. ★
Read more →
Department of Transportation Asks Travelers to ‘Bring Civility Back’ to Air Travel
2025-11-25T00:39:14Z | Source: Daring Fireball
The New York Times: Sean Duffy, the secretary of transportation, began a new campaign on Wednesday that he called “The Golden Age of Travel Starts With You,” complete with a 1960s-style public service announcement that spliced together scenes of the country’s first air travelers, dressed in suits and hats, with present-day clips of in-flight brawls and airport meltdowns. In the background, Frank Sinatra sings “Come Fly With Me.”. From the Department of Transportation website: Secretary Duffy posed a few key questions every flyer should ask themselves this holiday season to help Americans reach their destinations as quickly, efficiently and comfortably as possible: Are you helping a pregnant woman or the elderly with placing their bags in the overhead bin? Are you dressing with respect? Are you keeping control of your children and helping them through the airport? Are you saying thank you to your flight attendants? Are you saying please and thank you in general? “Quiet, piggy.” ★
Read more →
Why developers still flock to Python: Guido van Rossum on readability, AI, and the future of programming
2025-11-25 17:00 | Source: GitHub Engineering
When we shared this year’s Octoverse data with Guido van Rossum, the creator of Python, his first reaction was genuine surprise. While TypeScript overtook Python to become the most used language on GitHub as of August 2025 (marking the biggest language shift in more than a decade), Python still grew 49% year over year in 2025, and remains the default language of AI, science, and education for developers across the world. “I was very surprised by that number,” Guido told us, noting how this result is different from other popularity trackers like the TIOBE Index. To learn more, we sat down with Guido for a candid conversation about Python’s roots, its ever-expanding reach, and the choices—both big and small—that have helped turn a one-time “hobby project” into the foundation for the next generation of developers and technologies. Watch the full interview above. 👆 📦 What is Python? Python is a high-level, general-purpose programming language created by Guido van Rossum in 1991. It’s designed to be readable, intuitive, and easy to learn—using clean indentation instead of braces, friendly error messages, and a massive standard library. Developers use Python for everything from data science and AI to web apps, automation, scripting, scientific computing, and education. Its ecosystem includes widely used tools like NumPy, pandas, Django, FastAPI, PyTorch, and Jupyter. Because it’s open source, cross-platform, and backed by a huge global community, Python remains one of the most accessible and versatile languages in the world. The origins of Python For Guido, Python began as a tool to solve the very real (and very painful) gap between C’s complexity and the limitations of shell scripting. I wanted something that was much safer than C, and that took care of memory allocation, and of all the out of bounds indexing stuff, but was still an actual programming language. That was my starting point. Guido van Rossum, creator of Python He was working on a novel operating system, and the only available language was C. “In C, even the simplest utility that reads two lines from input becomes an exercise in managing buffer overflows and memory allocation,” he says. Shell scripts weren’t expressive enough, and C was too brittle. Building utilities for a new operating system showed just how much friction existed in the developer workflow at the time. Guido wanted to create language that served as a practical tool between the pain of C and the limits of shell scripting. And that led to Python, which he designed to take care of the tough parts, and let programmers focus on what matters. Python’s core DNA—clarity, friendliness, and minimal friction—was baked in from the beginning, too. It’s strangely fitting that a language that started as such a practical project now sits at the center of open source, AI, data science, and enterprise AI. Why TypeScript pulled ahead in 2025: Guido’s view Python held the top spot on GitHub for most of 2024 and half of 2025. But by August, TypeScript took the lead—and that surprised Guido. He offered several possible explanations: Modern static websites are checked into GitHub Modern JavaScript frameworks scaffold with TypeScript GitHub’s data reflects public and open source activity vs. global usage “If you’re writing JavaScript today, the logical conclusion is to use TypeScript,” Guido says. But he doesn’t view this competitively. He treats data like a puzzle, not a threat. Monty Python and the language’s personality Unlike other programming languages named for ancient philosophers or stitched-together acronyms, Python’s namesake comes from Monty Python’s Flying Circus. “I wanted to express a little irreverence,” Guido says. “A slight note of discord in the staid world of computer languages.” The name “Python” wasn’t a joke—it was a design choice, and a hint that programming doesn’t have to feel solemn or elitist. That sense of fun and accessibility has become as valuable to Python’s brand as its syntax. Ask practically anyone who’s learned to code with Python, and they’ll talk about its readability, its welcoming error messages, and the breadth of community resources that flatten that first steep climb. If you wrote something in Python last week and, six months from now, you’re reading that code, it’s still clear. Python’s clarity and user friendliness compared to Perl was definitely one of the reasons why Python took over Perl in the early aughts. Python and AI: ecosystem gravity and the NumPy to ML to LLM pipeline Python’s influence in AI isn’t accidental. It’s a signal of the broader ecosystem compounding on itself. Today, some of the world’s fastest-growing AI infrastructure is built in Python, such as PyTorch and Hugging Face Transformers. So, why Python? Guido credits the ecosystem around Python as the primary cause: after all, once a particular language has some use and seems to be a good solution, it sparks an avalanche of new software in that language, so it can take advantage of what already exists. Moreover, he points to key Python projects: NumPy: foundational numerical arrays pandas: making data manipulation easier PyTorch: Machine learning at scale Local model runners and LLM agents: Today’s frontier with projects like ollama leading the charge. The people now writing things for AI are familiar with Python because they started out in machine learning. Python isn’t just the language of AI. It enabled AI to become what it is today. That’s due, in part, to the language’s ability to evolve without sacrificing approachability. From optional static typing to a treasure trove of open source packages, Python adapts to the needs of cutting-edge fields without leaving beginners behind. Does Python need stronger typing in the LLM era? Guido says no. With AI generating more Python than ever, the natural question is: does Python need stricter typing? Guido’s answer was immediate: “I don’t think we need to panic and start doing a bunch of things that might make things easier for AI.” He believes Python’s optional typing system—while imperfect—is “plenty.” AI should adapt to us, not the other way around. He also offered a key insight: The biggest issue isn’t Python typing, but the training data. “Most tutorials don’t teach static typing,” he says. “AI models don’t see enough annotated Python. But LLMs can improve. “If I ask an AI to add a type annotation,” he says, “it usually researches it and gets it right.” This reveals a philosophy that permeates the language: Python is for developers first and foremost. AI should always meet developers where they are. Democratizing development, one developer-friendly error message at a time We asked why Python remains one of the most popular first programming languages. His explanation is simple and powerful: “There aren’t that many things you can do wrong that produce core dumps or incorrect magical results.” Python tells you what went wrong, and where. And Guido sees the downstream effect constantly: “A very common theme in fan mail is: Python made my career. Without it, I wouldn’t have gotten into software at all.” That’s not sentimentality. It’s user research. Python is approachable because it’s designed for developers who are learning, tinkering, and exploring. It’s also deeply global. This year’s Octoverse report showed that India alone added 5M+ developers in 2025, in a year where we saw more than one developer a second join GitHub. A number of these new developers come from non-traditional education paths. Guido saw this coming: “A lot of Python users and contributors do not have a computer science education … because their day jobs require skills that go beyond spreadsheets.” The clear syntax provides a natural entry point for first-time coders and tinkerers. As we’ve seen on GitHub, the language has been a launchpad not just for CS graduates, but for scientists in Brazil, aspiring AI developers in India, and anyone looking for the shortest path from idea to implementation. Whitespace complaints: Guido’s other inbox Python famously uses indentation for grouping. Most developers love this. But some really don’t. Guido still receives personal emails complaining. “Everyone else thinks that’s Python’s best feature,” he says. “But there is a small group of people who are unhappy with the use of indentation or whitespaces.” It’s charming, relatable, and deeply on brand. Stability without stagnation: soft keywords and backwards compatibility Maintaining Python’s momentum hasn’t meant standing still. Guido and the core dev team are laser-focused on backward compatibility, carefully weighing every new feature against decades of existing code. For every new feature, we have to very carefully consider: is this breaking existing code? Sometimes, the best ideas grow from constraints. For instance, Python’s soft keywords, context-sensitive new features that preserve old code, are a recent architectural decision that let the team introduce new syntax without breaking old programs. It’s a subtle but powerful engineering choice that keeps enterprises on solid ground while still allowing the language to evolve. This caution, often misinterpreted as reluctance, is exactly why Python has remained stable across three decades. For maintainers, the lessons are clear: learn widely, solve for yourself, invite input, and iterate. Python’s journey proves that what starts as a line of code to solve your own problem can become a bridge to millions of developers around the world. Designed for developers. Ready for whatever comes next. Python’s future remains bright because its values align with how developers actually learn and build: Readability Approachability Stability A touch of irreverence As AI continues to influence software development—and Octoverse shows that 80% of new developers on GitHub use GitHub Copilot in their first week—Python’s clarity matters more than ever. And as the next generation begins coding with AI, Python will be there to help turn ideas into implementations. Looking to stay one step ahead? Read the latest Octoverse report and try Copilot CLI. The post Why developers still flock to Python: Guido van Rossum on readability, AI, and the future of programming appeared first on The GitHub Blog.
Read more →
How GitHub’s agentic security principles make our AI agents as secure as possible
2025-11-25 16:00 | Source: GitHub Engineering
We’ve been hard at work over the past few months to build the most usable and enjoyable AI agents for developers. To strike the right balance between usability and security, we’ve put together a set of guidelines to make sure that there’s always a human-in-the-loop element to everything we design. The more “agentic” an AI product is, the more it can actually do, enabling much richer workflows, but at the cost of a greater risk. With added functionality, there’s a greater chance and a much greater impact of the AI going off its guardrails, losing alignment, or even getting manipulated by a bad actor. Any of these could cause security incidents for our customers. To make these agents as secure as possible, we’ve built all of our hosted agents to maximize interpretability, minimize autonomy, and reduce anomalous behavior. Let’s dive into our threat model for our hosted agentic products, specifically Copilot coding agent. We’ll also examine how we’ve built security controls to mitigate these threats, and perhaps you’ll be able to apply these principles to your own agents. Security concerns When developing agentic features, we are primarily concerned with three classes of risks: Data exfiltration When an agent has Internet access, it could leak data from the context to unintended destinations. The agent may be tricked into sending data from the current repository to an unintended website, either inadvertently or maliciously. Depending on the sensitivity of data, this could result in a severe security incident, such as if an agent leaks a write access GitHub token to a malicious endpoint. Impersonation and proper action attribution When an agent undertakes an action, it may not be clear what permissions it should have or under whose direction it should operate. When someone assigns the Copilot coding agent to an issue, who issued the directive—the person who filed the issue or the person who assigned it to Copilot? And if an incident does occur as a result of something an agent did, how can we ensure proper accountability and traceability for the actions taken by the agent? Prompt injection Agents operate on behalf of the initiating user, so it’s very important to ensure that the initiating user knows what the agent is going to do. Agents are prompted from GitHub Issues, files within a repository, and many other places, so it’s important to ensure that the initiator has a clear picture of all the information guiding it. If not, malicious users could hide directives and trick repository maintainers into running agents with bad directives. Rules for agentic products To help prevent the above risks, we have created a set of rules for all of our hosted agentic products to make them more consistent and secure for our users. Ensuring all context is visible Allowing invisible context can allow malicious users to hide directives that maintainers may not be able to see. For example, in the Copilot coding agent, a malicious user may create a GitHub Issue that contains invisible Unicode with prompt injection instructions. If a maintainer assigns Copilot to this issue, this could result in a security incident as the maintainer would not have been aware of these invisible directives. To prevent this, we display the files from which context is generated and attempt to remove any invisible or masked information via Unicode or HTML tags before passing it to the agent. This ensures that only information that is clearly visible to maintainers is passed to the agent. Firewalling the agent As mentioned previously, having unfettered access to external resources can allow the agent to exfiltrate sensitive information or be prompt-injected by the external resource and lose alignment. We apply a firewall to the Copilot coding agent to limit its ability to access potentially harmful external resources. This allows users to configure the agent’s network access and block any unwanted connections. To balance security and usability, we automatically allow MCP interactions to bypass the firewall.. In our other agentic experiences like Copilot Chat, we do not automatically execute code. For example, when generating HTML, the output is initially presented as code for preview. A user must manually enable the rich previewing interface, which executes the HTML. Limiting access to sensitive information The easiest way to prevent an agent from exfiltrating sensitive data is… to not give access to it in the first place! We only give Copilot information that is absolutely necessary for it to function. This means that things like CI secrets and files outside the current repository are not automatically passed to agents. Specific sensitive content, such as the GitHub token for the Copilot coding agent, is revoked once the agent has completed its session. Preventing irreversible state changes AI can and will make mistakes. To prevent these mistakes from having downstream effects that cannot be fixed, we make sure that our agents are not able to initiate any irreversible state changes without a human in the loop. For example, the Copilot coding agent is only able to create pull requests; it is not able to commit directly to a default branch. Pull requests created by Copilot do not run CI automatically; a human user must validate the code and manually run GitHub Actions. In our Copilot Chat feature, MCP interactions ask for approval before undertaking any tool calls. Consistently attributing actions to both initiator and agent Any agentic interaction initiated by a user is clearly attributed to that user, and any action taken by the agent is clearly attributed to the agent. This ensures a clear chain of responsibility for any actions. For example, pull requests created by the Copilot coding agent are co-committed by the user who initiated the action. Pull requests are generated using the Copilot identity to make it clear that they were AI-generated. Only gathering context from authorized users We ensure that agents gather context only from authorized users. This means that agents must always operate under the permissions and context granted by the user who initiated the interaction. The Copilot coding agent can only be assigned to issues by users who have write access to the underlying repository. Plus, as an additional security control, especially for public repositories, it only reads issue comments from users who have write access to the underlying repository. Try it out now We built our agentic security principles to be applicable for any new AI products; they’re designed to work with everything from code generation agents to chat functionality. While these design decisions are intended to be invisible and intuitive to end users, we hope this makes our product decisions clearer so you can continue to use GitHub Copilot with confidence. For more information on these security features, check out public documentation for Copilot coding agent. Try out our new agentic products with GitHub Copilot > The post How GitHub’s agentic security principles make our AI agents as secure as possible appeared first on The GitHub Blog.
Read more →
SuperDuper Security Update v3.11
2025-11-24T23:35:46Z | Source: Daring Fireball
Dave Nanian and Bruce Lacey, at Shirt Pocket: Mistakes are a part of life. They’re not a great part, but when viewed “correctly”, they’re an opportunity. Well, we have three opportunities, brought to our attention by a security researcher. They’re security vulnerabilities that have been in SuperDuper! since the very first version, released almost 22 years ago. Today, we’re releasing fixes for the current release (the SuperDuper! v3.20 Beta is already fixed), a discussion of the problems, and the steps users can take to mitigate the issues if they cannot install the update. We don’t know of any bad actors making use of these exploits as of this post. Another good postmortem, with technical details and an apology. ★
Read more →
Developers still need the right to challenge junk patents
2025-11-24 16:00 | Source: GitHub Engineering
Just like they did two years ago, the U.S. Patent and Trademark Office has once again proposed new rules that would make it much harder to challenge bad patents through inter partes review (IPR). But this time the rule is much worse for developers and startups. And that’s a serious concern. Congress created IPRs so those most vulnerable to weaponized patents–startups and developers–could challenge whether a patent should have even been granted efficiently and fairly without the cost of a full-blown federal litigation. Preserving that ability strengthens American innovation, open source, and small-business growth. The 2023 proposal would have added procedural hurdles. But even with those hurdles developers and startups would still always have their own path to challenge low-quality patents. The 2025 proposal is different. It would impose bright-line rules that block IPR petitions in many common scenarios—such as when a claim has ever been upheld in any forum or when a parallel case is likely to finish first. It would also require petitioners to give up all invalidity defenses in court if they pursue IPR. These changes would prevent developers from challenging the patent whenever some other party tried and failed. This makes IPR far less accessible, increasing litigation risk and costs for developers, startups, and open source projects. Innovation isn’t about patents—it’s about people writing code, collaborating, and building tools that power the world. GitHub’s inclusion in the WIPO Global Innovation Index reflects how developers and openness drive progress. Policies that close off avenues to challenge bad patents that block open innovation don’t just affect lawyers—they affect the entire ecosystem that makes innovation possible. We’re calling on developers, startups, and open source organizations that could be impacted by these rules to file comments underscoring the broad concerns patent trolls pose to innovation. File a comment and make your voice heard before the comment period closes on December 2. The post Developers still need the right to challenge junk patents appeared first on The GitHub Blog.
Read more →
Clerk for iOS
2025-11-23T23:01:00Z | Source: Daring Fireball
My thanks to Clerk for sponsoring last week at DF. Clerk makes authentication for iOS apps effortless — just drop in pre-built SwiftUI components for sign-in, MFA, and profile management. Fully customizable, always in sync with Apple’s design system, and packed with features developers love: social sign-in, user roles, and organization management. Launch faster, stay secure, and scale confidently, whether you’re building the next big thing or a startup MVP. See how Clerk makes complete user management easy for modern iOS teams. ★
Read more →
★ Exploring, in Detail, Apple’s Compliance With the EU’s DMA Mandate Regarding Apple Watch, Third-Party Accessories, and the Syncing of Saved Wi-Fi Networks From iPhones to Which They’re Paired
2025-11-23T23:00:00Z | Source: Daring Fireball
There have been several new features that have been delayed in the EU while Apple tried to make them compliant with the DMA. iPhone Mirroring debuted over a year ago with iOS 18 and MacOS 15 Sequoia, but still remains unavailable today in the EU. Apple Intelligence was delayed in the EU until iOS 18.4 in April, but was available to most of the world in 18.1 last October. And, both most recently and briefly, the live translation feature for AirPods Pro 3, AirPods Pro 2, and AirPods 4, which debuted outside the EU with the launch of iOS 26.0 in September, will only become available in the EU next month, with the launch of iOS 26.2. But now comes word of the first feature that Apple is limiting or removing in an existing product to comply with the DMA: Wi-Fi network sync between iPhone and Apple Watch, which is poised to change in the EU next month, with the 26.2 releases of iOS and WatchOS. The news was broken by Nicolas Lellouche, reporting for the French-language site Numerama. I’m quoting here from Safari’s English translation of his original report: Apple has been warning for several months that it could one day, if it deems it necessary, disable functions in the European Union to “protect its users”. This day could arrive in December, with the iOS 26.2 update. On November 4, Apple announced to Numerama that it had made the decision to disable Wi-Fi synchronization between an iPhone and an Apple Watch in Europe so as not to have to comply with the European Commission’s request, which wants to force it by the end of 2025 to open the iPhone’s Wi-Fi to third-party accessories. This announcement follows the opening of the AirPods Live Translation function in Europe, with a new API to allow competitors to use the microphones and speakers of AirPods and iPhone simultaneously. [...] Apple indicates that the European Commission is asking it to replicate the link between an iPhone and an Apple Watch, but with third-party products. Apple, after thinking long about how to implement this function, finally decided to reject the European request. Since Europe requires that third-party products be treated like the Apple Watch, then Apple disables the function on Apple Watch. This allows it to comply with the DMA. Lellouche’s report at Numerama broke this story (the reports at MacRumors and 9to5Mac are both based on Numerama’s), but the above is not an accurate summary of what Apple is doing with iOS 26.2.1 Apple is complying with the DMA, and they’re not disabling Wi-Fi network synchronization between an iPhone and a paired Apple Watch. What Apple is doing, in order to comply with the DMA, is changing how Wi-Fi networks sync with Apple Watch (in the EU), and offering new APIs in the EU for third-party paired devices to put them on equal (or near-equal?) footing with Apple Watch (in the EU). This change should be relatively limited. Honestly, I don’t think many Apple Watch users in the EU will even notice. But it is at least mildly annoying, and the relatively minor, very specific nature of this particular DMA mandate makes it a telling example of the European Commission’s overreach. Currently, when you pair a new Apple Watch with an iPhone, iOS transfers to WatchOS the iPhone’s entire list of saved Wi-Fi networks and their passwords — directly, device-to-device. As iOS learns of new networks that the user joins from their iPhone, that information continues to be shared with any Apple Watches paired to that iPhone. The utility of this is that if you’re wearing your Apple Watch, but don’t have your iPhone nearby, your watch will join an available saved Wi-Fi network at your location. Let’s say you go for a run or walk, with only your Apple Watch, and you stop at a cafe for a beverage. If you’ve ever joined the Wi-Fi network at that cafe from your iPhone (or iPad or Mac, assuming you sync your Apple Keychain via iCloud), your Apple Watch will join that network automatically. It should, and in my personal experience does, just work. The EU mandate to Apple is not that Apple must grant to third-party devices and their iOS companion applications this same functionality as it stands today — that is to say, access to the entire history of the iPhone’s known Wi-Fi networks. The EU mandate is that Apple must grant to third-party devices the same level of access to Wi-Fi network information that Apple Watch has. Apple is complying with this mandate in two ways: (a) by changing how much Wi-Fi network information an Apple Watch gets from the iPhone to which it is paired; and (b) creating a new framework in iOS 26.2 (gated by a new entitlement), Wi-Fi Infrastructure, that provides a set of public APIs, available only to apps in the EU, to (per the framework’s description) “share Wi-Fi network credentials securely between devices and connected accessories.” The change for Apple Watch in the EU is that starting with iOS 26.2, when a new (or reset) Apple Watch is set up, the Apple Watch will no longer have the user’s list of saved Wi-Fi networks automatically synced from their iPhone. Only future networks will be synced — the same level of access that the new Wi-Fi Infrastructure framework is making available to third-party accessories. Under the new rules for Apple Watch in the EU, an existing (that is to say, already configured) watch that is upgraded to WatchOS 26.2 will still remember all Wi-Fi networks it already knew about. But a new Apple Watch will only be able to automatically connect to Wi-Fi networks that its associated iPhone saves after the Apple Watch was set up and paired. So when an EU Apple Watch owner with a new watch visits a known location, and doesn’t have their iPhone with them, the watch won’t be able to join that location’s Wi-Fi automatically, unless the paired iPhone has connected to and saved that network after the watch was paired. With iOS 26.2, the behavior for users outside the EU will remain unchanged from iOS 26.1 and prior — both for Apple Watch and for third-party accessories. A user’s Wi-Fi history can be used to glean significant information about them. Who they know (other homes’ networks), where they’ve been (medical providers, restaurants, airports), and more. Apple’s new policy for Apple Watch and third-party devices is DMA-compliant and prevents the sharing of historical networks, but with the sharing of future networks as the associated iPhone joins them, there’s still a risk here of third-party companies doing things with the user’s Wi-Fi network information that the user doesn’t understand, or want (but doesn’t realize they’ve consented to). One way to look at Apple’s options for complying with this particular DMA mandate is by considering the extremes. On the one extreme, Apple could have just granted third-party peripherals in the EU the exact same access to users’ iPhone Wi-Fi network history that Apple Watch has gotten until now (and will continue to get outside the EU). On the other extreme, Apple could have cut off Wi-Fi network syncing to the Apple Watch altogether, requiring users to connect to each Wi-Fi network manually, using the Watch itself or the Apple Watch app on iPhone. Instead, Apple chose a middle ground — limiting Wi-Fi network history sync to the Apple Watch in the EU in ways that it isn’t limited anywhere else in the world, but granting third-party accessories in the EU access to these new Wi-Fi Infrastructure APIs that aren’t available outside the EU. Critics might argue that while this middle ground is technically compliant with the DMA, it’s not compliant with the intention of the DMA, which would be for the Apple Watch not to lose any functionality in the EU, and for Apple to provide APIs to allow third-party devices all of the Wi-Fi syncing features currently available to Apple Watch. Apple would argue, and I agree, that the European Commission’s intentions are incoherent in this regard. The EC insists that Apple should protect users’ privacy and security, while also insisting that Apple grant access to third-party apps and devices that can potentially compromise users’ privacy and security. There’s a reason why Apple isn’t offering the new Wi-Fi Infrastructure framework outside the EU, and that’s because they don’t believe it’s a good idea to grant any access at all to your saved Wi-Fi networks to third-party apps and devices. Especially without being able to specify, let alone enforce, a policy that Wi-Fi network information should be treated the way Apple treats it — remaining exclusively on device. The skeptical take on Apple’s motivations in this situation is that Apple is spitefully removing functionality from Apple Watch rather than offering new APIs to provide third-party devices with the same functionality that Apple Watch currently has, and that Apple’s intention here is, somehow, primarily about trying to drive anti-DMA sentiment amongst its EU users. This is, in fact, the skeptical take on every single aspect of Apple’s compliance with the DMA: spiteful “malicious compliance” that, somehow, is intended to engender grassroots opposition to the DMA amongst Apple customers in the EU. I don’t think that’s an accurate take overall, but in this particular case with Apple Watch and Wi-Fi network sync, it’s almost silly. Part of what makes this particular situation clarifying is that it’s so specific. It’s not about allowing third-party devices and their corresponding iOS apps to do everything that Apple Watches, and the Apple Watch iOS companion app, can do. It’s very specifically about the sharing of known Wi-Fi networks. (There will, surely, be other such situations to come regarding other features, for other Apple devices.) And as I described above, very few Apple Watch owners in the EU are likely to notice the change. How many Apple Watch users today realize that their watch automatically connects to known Wi-Fi networks when their iPhone is outside Bluetooth range? If Apple were motivated by spite, and were trying to turn EU Apple Watch owners against the DMA, they’d just remove all Wi-Fi network syncing between the watch and its paired iPhone. Not just the historical list of all networks the iPhone has ever connected to, but the continuous sync of new networks the iPhone joins after the Apple Watch is paired. That would be a change Apple Watch users would be more likely to notice. But it’s not what Apple is doing. They’ve engineered an entire framework of public APIs to comply with the EC’s mandate. But the reporting to date on this situation, starting with Numerama, paints the picture that Apple is dropping all Wi-Fi sync between WatchOS and iOS in the EU, and that Apple is refusing to make Wi-Fi network information available to third-party accessories. Here’s Michael Tsai, after quoting from Tim Hardwick’s summary at MacRumors of Numerama’s report: It seems perfectly reasonable that if I have a third-party watch I should be able to opt into having my phone share Wi-Fi info with it. You can debate whether mandating this is the proper role of government, but the status quo is clearly anti-competitive and bad for the user experience. I’m open to hearing a story where Apple’s position makes sense, but so far it just seems like FUD to me. What is the argument, exactly? That Fitbit, which already has its own GPS, is going to sell your access point–based location history? That Facebook is going to trick you into granting access to their app even though they have no corresponding device? Tsai is making a few wrong assumptions here. First, Apple is enabling users (in the EU) to opt into having their iPhone share Wi-Fi information with third-party devices. Second, this mandate is not specific to smartwatches — it applies to any devices that can pair with an iPhone and have corresponding iOS partner apps. So Meta, with their lineup of smartglasses, does have corresponding devices. And, per Apple’s public statements, it is Meta in particular that has been zealously pursuing interoperability mandates pursuant to the DMA. I think it’s entirely possible that this entire issue regarding Wi-Fi network sharing was prompted by Meta’s interoperability requests to the European Commission.2 As for the argument regarding why Apple has chosen to comply in this way, what is essential to note is that none of this Wi-Fi network information shared between iOS and WatchOS is ever sent to or seen by Apple. Apple doesn’t see the network passwords, doesn’t see the names of the networks, and doesn’t even know when a device has joined a new network. All of this is exclusively on-device, and when the information is exchanged between an iPhone and paired Apple Watch, it’s transferred device-to-device. (This is also true when you use Apple’s features to share Wi-Fi passwords with nearby friends. It’s device-to-device and entirely private and secure. Apple doesn’t even know that person A sent a Wi-Fi password to person B, let alone know the name of the network or the password.) Here’s Rui Carmo, at Tao of Mac: As someone who relies a lot on the Watch (especially now that WhatsApp works locally on it), I’d say we have officially reached the point where Apple is on the verge of actively harming their user experience for no good reason whatsoever. I honestly don’t know if this is bull-headedness or malicious compliance. On the other hand, someone at the EU clearly prefers being in the limelight by regulating against evil US corporations in ways that affect very small parts of the general population rather than, say, go after Asian smart TV manufacturers that are present in millions of homes and resell data on Europeans’ TV viewing habits. No notes on Carmo’s second point. But regarding the first, his opinion is founded on incorrect assumptions. Apple clearly thinks it’s a bad idea to share any Wi-Fi information at all with third-party devices, but they’ve created an entire new framework for use within the EU to allow it, just so they can continue syncing any Wi-Fi network information at all with Apple Watch. Far from harming the user experience, Apple is bending over backwards to make the Apple Watch experience as good as possible while balancing the privacy and security implications of this DMA mandate. Rather than take away all Wi-Fi network syncing, Apple is leaving most of it in place, and only eliminating (in the EU) the part at the very beginning, where, during the set up process, all of the current networks saved on the iPhone are synced to the Apple Watch. Given the mandate regarding the DMA, and given the privacy implications of sharing any of this information with third-party developers and peripheral makers, personally, I think it would have been reasonable for Apple to take the extreme position of simply disallowing Wi-Fi network information syncing to any and all devices, including Apple Watches, in the EU. There is no reason to trust third-party developers with any of this information. But Apple isn’t doing that, and they’ve undertaken a significant software engineering effort — just for the EU — to support the path they’ve chosen. Carmo’s critique seems predicated on the assumption that Apple is just cutting off all Wi-Fi network sharing. Given that Apple’s compliance needs to account for potentially untrustworthy device makers — whether by intent, or incompetence — not syncing all known networks seems like a reasonable trade-off. Leave it to Tim Sweeney to espouse the maximalist perspective: Why simply not ask the user whether or not to share WiFi history identically whether connecting to an Apple product or a Meta product? That is, in fact, what Apple is doing. But the privacy implications for a user are, in fact, different when an iPhone’s saved Wi-Fi networks are shared to, say, a Meta product than to another Apple product. It’s worth emphasizing that the European Commission’s mandate does not permit Apple to require those third-party companies to treat this information with the same privacy protections that Apple does. Apple keeps that information exclusively on-device, but Apple is not permitted to require third-party peripheral makers to do the same. Consider the iOS system prompt for App Tracking Transparency: the user’s two choices are “Ask App Not to Track” and “Allow”. It’s a common and natural question why the first option is “Ask App Not to Track” rather than “Don’t Allow”. It would certainly look better if the options were “Don’t Allow” and “Allow”. But Apple deliberately made the first button “Ask App Not to Track” because ATT is, at least partially, a policy, not a complete technical guarantee. If an app prompts for ATT permission and the user chooses “Ask App Not to Track”, that app should definitely not go ahead and attempt to track the user’s activity across other apps. But, technically, it could try.3 I presume that if they do, if and when Apple notices, Apple will rap the developer’s knuckles in the App Store review process, or even suspend the app’s developer account. But one can see why Apple would want to avoid such a pissing match with Facebook/Meta again.4 Under the EU’s mandate to Apple regarding Wi-Fi network access for third-party devices and their corresponding iOS apps, Apple is not permitted even to set a policy that these apps must pinky swear to keep the information private and on-device. Nor is the EU itself demanding it. If a third-party device-maker wants to send your iPhone’s Wi-Fi network history and credentials to their servers and save it, that’s up to them, not Apple, per the EC. Apple sees that as a problem.5 You can argue — and some will, as I think Michael Tsai does in the passage I quote above, and as Tim Sweeney clearly does — that this ought to be up to the user. If a user says they’re fine with their Wi-Fi network information being shared with a third-party accessory they’ve paired with their iPhone, that’s up to them. That is a reasonable take. But I also think Apple’s perspective is reasonable as well — that they should be able to make products where this isn’t possible. The “it should be up to the user” take benefits informed, technically savvy users. The “it shouldn’t be possible” take benefits uninformed, un-savvy users — users who in many cases have decided that they simply trust Apple. The iPhone brand message — the brand message behind the Apple ecosystem — is that Apple doesn’t allow things that are dangerous to security or privacy. I do not think most iPhone users expect a third-party device they pair to their iPhone to be able to send their entire history of Wi-Fi networks back to the company that made the device. (Most iPhone users also don’t realize how sensitive, privacy-wise, their complete Wi-Fi network history is.) It’s fair to point out that the “it should be up to the user” take is more beneficial to third-party accessory makers than the “it shouldn’t be possible” take. And that this conflict of interest — where the same limitations that protect iPhone users’ privacy by definition disadvantage third-party devices in ways that Apple’s own devices that connect to iPhones are not — works not just in iPhone users’ favor, privacy-wise, but also in Apple’s favor, financially. Apple can sell more Apple Watches if they work better with iPhones than smartwatches from other companies do. That’s obviously true, but that’s just another way of saying that first-party products have inherent advantages that third-party products don’t, to which I say: Duh. Apple’s own peripherals, like Apple Watch, can do things that third-party peripherals can’t because Apple can trust its own devices, and its own software, in ways that it can’t trust devices and companion apps made by other companies. It’s natural for a company to bootstrap a new product on the back of an existing successful one. Meta’s Threads social network, for example, uses the same usernames and sign-in system as Instagram, which is arguably the most successful social network in the world. Should Meta not have been permitted to do that? Or should they be forced to allow anyone to create new competing social networks using Instagram user accounts as the ID system? It’d be pretty weird if Apple limited itself, when designing and engineering features that integrate experiences across its own devices, to what it would allow third-party developers to do. It’d be even weirder if Apple allowed third-party developers to do everything Apple’s own software can do.6 For at least the last 15 years, I’ve repeatedly emphasized that Apple’s priorities are in this order: Apple first, users second, developers third. The DMA attempts to invert that order, privileging developers first (in the ostensible name of fair competition with Apple, a designated “gatekeeper”), ahead of users, and ahead of Apple itself. So of course Apple is going to object to and resist mandates that require it to subordinate its own strategic desires — its own sense of how its products ought to be designed and engineered — especially when the primary beneficiary of the mandates aren’t users, but developers. Many of whom, especially the larger ones, are Apple’s competitors. But I also think it’s clear, with Apple in particular, that users prefer Apple’s priorities. People are happier with Apple putting users’ considerations ahead of developers’ than they are when developers are free to run roughshod over the software platform. The clearest example of that is the App Store. It’s overwhelmingly developers, not users, who object to the App Store model — the exclusivity of distribution, the exclusivity of the vendor’s payment system, the vendor’s payment commissions, the vendor’s functional guidelines and restrictions, all of it. Users largely don’t have a problem with any of that. That’s why Apple commissioned and then publicized a study, just this month, that showed that DMA-driven changes saved developers €20 million in commissions, but that reduction in commissions didn’t lower the prices users pay. Developer-focused observers see that as a win for the DMA — that’s €20 million in developers’ pockets that otherwise would have gone into Apple’s already overflowing pockets. But a user-focused observer might see that as clarifying regarding the fact that the DMA wasn’t designed to benefit users, and isn’t benefiting users in practice either. Apple doesn’t care about €20 million. They fart bigger than that. They do care about clarifying who the DMA prioritizes first, and that it’s not users. (And, of course, that it’s not Apple itself.) Users love the App Store model. With Apple in particular, users, by and large, like the idea that the platforms have stringent guardrails. Many buy iPhones because Apple exerts such control over the platform, not despite it. But that control is exactly why Apple has been so singularly targeted by the European Commission regarding DMA mandates, despite the fact that Samsung by itself — let alone the Android platform as a whole — sells more phones in Europe (and the world) than Apple does. The bottom line is that users setting up new Apple Watches in the EU will now get a slightly worse experience in the name of parity with accessories made by third-party companies. It remains to be seen whether users of third-party iPhone accessories and peripherals in the EU will see any benefit at all (because the companies that make their devices will need to adopt these new EU-exclusive Wi-Fi Infrastructure APIs in their iOS companion apps) — and, if the users of third-party iPhone accessories do see the benefit of Wi-Fi network information syncing to their devices, whether their privacy will be respected. But don’t make the mistake of thinking that Apple is complying the least bit spitefully with regard to this mandate. I’m quoting Apple/Safari’s French-to-English translation, but the gist seems exactly the same in Google’s translation as well. ↩︎ It remains to be seen whether Meta will actually use the new Wi-Fi Infrastructure framework to allow their accessories, like their AI Glasses, to obtain Wi-Fi network information from Meta’s companion iOS app. I’m guessing they almost certainly would, if the Wi-Fi Infrastructure APIs were available globally. But these APIs are exclusive to the EU. Will Meta deem it worth the engineering effort to support this feature only for users in the EU? We shall see. It’s worth remembering that one of the initial DMA mandates the EU issued to Apple was that iOS must support third-party web browser rendering engines, and to comply with this, Apple spent significant (and I suspect that’s a vast understatement) engineering resources to create the BrowserEngineKit and BrowserEngineCore frameworks, and here we are at the end of 2025, nearly two years after Apple shipped those frameworks, and there are exactly zero browsers on iOS using alternative rendering engines. Zero. These frameworks might be the largest set of APIs ever created that never get used. I wouldn’t be surprised if the new Wi-Fi Infrastructure framework sees the same fate. (Meta might consider that a win, just knowing that Apple had to expend this effort for naught.) ↩︎︎ Apple has a good layperson-approachable overview of App Tracking Transparency. At a technical level, an app must prompt for and receive the user’s permission (via the Allow button in the system-provided ATT prompt) in order to access the device’s advertising identifier. From that document: “Unless you receive permission from the user to enable tracking, the device’s advertising identifier value will be all zeros and you may not track them as described above.” But returning zeroes for the device’s advertising identifier doesn’t technically prevent a devious developer from attempting to uniquely identify and track the user by other means. If the button in the system prompt said “Don’t Allow”, rather than “Ask App Not to Track”, it would imply that Apple could guarantee the app isn’t tracking you (or trying to track you) without your permission. Apple can’t guarantee that, so they don’t imply that they can. ↩︎︎ I’m not aware of any instances where an app has been accused of disregarding the ATT “Ask App Not to Track” request, but surely it has happened. If you’re aware of any such accusations, and how Apple responded, let me know. ↩︎︎ I’m not arguing here that the European Commission doesn’t care about user privacy, or that I think the European Commission doesn’t realize that Wi-Fi network information is quite sensitive. I’m sure they do care about user privacy and do realize that Wi-Fi network information is privacy-sensitive. What I do think is that the European Commission believes the privacy of this information should only be guarded by law, and that they already have laws in place that protect such information. And thus it’s not Apple’s place — especially now that they’ve been deemed a “gatekeeper” that has the power to stymie competition — to attempt to protect that information, whether by technical limitations or by policy. Apple is certainly not opposed to privacy-protecting laws, in the abstract, but doesn’t see the law alone as protection enough. Apple’s perspective is that protecting their customers’ privacy is, in fact, Apple’s responsibility — and one of their most important responsibilities at that. It’s illegal to steal cars, but every carmaker still puts locks on the doors and requires a key to start the engine. In numerous ways, Apple sees the DMA as mandating, privacy-wise, that they create something akin to cars that don’t require keys, trusting EU law to keep them from being stolen. The European Commission only sees Apple’s protections as blocking would-be competitors, not would-be privacy thieves. ↩︎︎ In the old days, of course, with devices designed before the iPhone, this wasn’t weird. All software, whether first- or third-party, could do whatever it wanted to. Anyone could write a kernel extension. In the classic Mac OS days there was no “kernel” and we just had “extensions” and you could just drop one in your Extensions folder, restart, and boom, whatever system extension you just installed was now effectively part of the operating system. Any app could read and write anything on disk, including into the operating system. Go back far enough and apps could read and write (deliberately or accidentally) inside the memory of another running application. To split personal computing — not just PCs but all personal computing devices, in the plain sense of the words — into three eras, there was (1) the early era when all software was effectively “root”; (2) the middle era, still exemplified today by MacOS and Windows, when there were user-controlled protections on what could run as root; and (3) the modern era, as exemplified by iOS and stock Android, where the vendor controls what can run as root. You can reasonably make the case — and expert-level users (read: nerds) often do — that the user should always be in control. I bought the device, I should be able to run whatever software, with whatever privileges, I want. That perspective is valid, but it also describes a class of devices — PCs — that privilege the autonomy of third-party developers over the vendor-controlled stability of the OS. The PC model, where accessory makers can offer software that runs with root (or root-like) escalated privileges, offers significantly greater opportunities for third-party accessory makers than the mobile model, where accessories are limited to whatever public APIs are provided by the device vendor for integration. But with the PC model, users can “mess up” their system by installing software they shouldn’t have, or that they regret having installed but don’t know how to remove. With the mobile model, users are technically prevented from installing anything that could “mess up” their system. It’s always about trade-offs. And with this particular trade-off, it’s very clear which model is more successful in the market. It’s not feasible to make computers intended for use by anyone and everyone which require any degree of technical knowledge or expertise to manage. ↩︎︎
Read more →
The Talk Show: ‘Lincoln Bio Services’
2025-11-22T17:17:19Z | Source: Daring Fireball
For your weekend listening enjoyment: a new episode of America’s favorite 3-star podcast, with special guest Stephen Robles. Topics include indie media and YouTube, Shortcuts and automation, and the state of podcasting. Sponsored by: Uncommon Goods: Out of the ordinary gifts, great for the holidays. Save 15% off your next purchase after following that link. ★
Read more →
Jmail
2025-11-21T22:25:12Z | Source: Daring Fireball
Luke Igel and Riley Walz made a phony Gmail interface that, rather than showing you your email, shows you Jeffrey Epstein’s emails: You’re logged in as Jeffrey Epstein. We compiled these Epstein estate emails from the House Oversight release by converting the PDFs to structured text with an LLM. Brilliant. ★
Read more →
Another Limited Edition Accessory From Apple: Hikawa Phone Grip and Stand
2025-11-21T20:48:06Z | Source: Daring Fireball
Apple Store: The Hikawa Phone Grip & Stand is a MagSafe compatible adaptive accessory for iPhone designed by Bailey Hikawa to celebrate the 40th anniversary of accessibility at Apple. Designed with direct input from individuals with disabilities affecting muscle strength, dexterity, and hand control, this ergonomic grip was designed with accessibility in mind from the ground up. The grip uses magnets to securely snap onto any iPhone with MagSafe, can be removed with ease, and doubles as a stand to support iPhone at two different viewing angles, both vertically and horizontally. Inspired by modern sculpture, each Hikawa product is an art object unto itself. The limited edition Hikawa Phone Grip & Stand is available in two colors, a bold, high-visibility Chartreuse and recycled Crater, exclusive to Apple. Looks like a perfectly cromulent accessory, but Chartreuse and Crater are both a bit out there — in different ways — to be the only two color options. Or, I should say, were a bit out there. Both are already sold out from Apple. I’m not quite sure what’s limited about the Chartreuse, given that Hikawa’s website still lists it as “ready to ship” along with pre-orders for Cobalt and Blurple Swirl (whose URL seems a bit rushed). Amusing to see Apple partner with a company whose main products alongside iPhone cases are fanciful toilet seats. ★
Read more →
‘Grok’s Elon Musk Worship Is Getting Weird’
2025-11-21T20:10:56Z | Source: Daring Fireball
Adi Robertson, The Verge: As a number of people have pointed out on social media over the past day, Grok’s public-facing chatbot is currently prone to insisting on Musk’s prowess at absolutely anything, no matter how unlikely — or conversely, embarrassing — a given feat is. Grok claims Musk is fitter than LeBron James, funnier than Jerry Seinfeld, and would likely figure out a way to resurrect himself from the dead faster than Jesus. But it’s a trustworthy source to author an encyclopedia, sure. ★
Read more →
Group Chats in ChatGPT Now Available Worldwide
2025-11-21T17:28:41Z | Source: Daring Fireball
OpenAI: Early feedback from the pilot has been positive, so we’re expanding group chats to all logged-in users on ChatGPT Free, Go, Plus and Pro plans globally over the coming days. We will continue refining the experience as more people start using it. That didn’t take long — the initial rollout limited to Japan, New Zealand, Korea, and Taiwan started just three days ago. ★
Read more →
Fun Stunt to Promote ‘Pluribus’: An Ask Me Anything on Reddit With Carol Sturka
2025-11-21T17:20:28Z | Source: Daring Fireball
“Carol Sturka”, actress Rhea Seehorn’s fictional protagonist of the new Apple TV series Pluribus, is on Reddit right now — at 12n ET / 9am PT — doing an AMA in character. Sturka is a fantasy novelist, and Apple Books has an 11-page excerpt of her “new” novel Bloodsong of Wycaro. Unclear whether it’s Seehorn writing the in-character responses, but it’s definitely Seehorn in the confirmation photo. Reminiscent of some of the promotional fun Apple has had for Severance. Both my wife and I are loving Pluribus so far. I highly recommend watching the first episode without even knowing the premise, if you can. ★
Read more →
‘Pixar: The Early Days’ — Never-Before-Seen 1996 Interview With Steve Jobs
2025-11-21T00:21:40Z | Source: Daring Fireball
The Steve Jobs Archive: To mark Toy Story’s 30th anniversary, we’re sharing a never-before-seen interview with Steve from November 22, 1996 — exactly one year after the film debuted in theaters. Toy Story was the world’s first entirely computer-animated feature-length film. An instant hit with audiences and critics, it also transformed Pixar, which went public the week after its premiere. Buoyed by Toy Story ’s success, Pixar’s stock price closed at nearly double its initial offering, giving it a market valuation of approximately $1.5 billion and marking the largest IPO of 1995. The following year, Toy Story was nominated for three Academy Awards en route to winning a Special Achievement Oscar in March. In July, Pixar announced that it would close its television-commercial unit to focus primarily on feature films. By the time of the interview, the team had grown by 70 percent in less than a year; A Bug’s Life was in production; and behind the scenes, Steve was using his new leverage to renegotiate Pixar’s partnership with Disney. Kind of a weird interview. The video quality is poor, and whoever was running the camera zoomed in and out awkwardly. It’s like ... just a VHS tape? But it’s also weird in a cool way to get a “new” Steve Jobs interview in 2025, and Jobs, as ever, is thoughtful and insightful. Well worth 23 minutes of your time. There’s a particularly interesting bit at the end when Jobs discusses how Pixar was half a computer company (with extraordinary technology) and half a movie studio (with extraordinary filmmaking talent), but eventually they had to choose between the two industries for how to pay their employees to motivate them to remain at Pixar. The Hollywood way would be with contracts; the Silicon Valley way would be with stock options. Jobs chose the Silicon Valley path for Pixar. ★
Read more →
Evolving GitHub Copilot’s next edit suggestions through custom model training
2025-11-20 18:02 | Source: GitHub Engineering
Editing code often involves a series of small but necessary changes ranging from refactors to fixes to cleanup and edge-case handling. In February, we launched next edit suggestions (NES), a custom Copilot model that predicts the next logical edit based on the code you’ve already written. Since launch, we’ve shipped several major model updates, including the newest release earlier this month. In this post, we’ll look at how we built the original model, how we’ve improved it over time, what’s new, and what we’re building next. Why edit suggestions are challenging Predicting the next edit is a harder problem than predicting the next token. NES has to understand what you’re doing, why you’re doing it, and what you’ll likely do next. That means: The model must respond quickly to keep up with your flow. It has to know when not to suggest anything (too many suggestions can break your focus). It must infer intent from local context alone without your explicit prompts. It must integrate deeply with VS Code so suggestions appear exactly where you expect them. Frontier models didn’t meet our quality and latency expectations. The smaller ones were fast but produced low-quality suggestions, while the larger ones were accurate but too slow for an in-editor experience. To get both speed and quality, we needed to train a custom model. NES isn’t a general-purpose chat model. It’s a low-latency, task-specific model that runs alongside the editor and responds in real time. It’s the result of aligning model training, prompting, and UX around a single goal: seamless editing inside the IDE. That required tight coordination between model training, prompt design, UX design, and the VS Code team—the model only works because the system was co-designed end-to-end. This “AI-native” approach where every part of the experience evolves together is very different from training a general-purpose model for any task or prompt. It’s how we believe AI features should be built: end to end, with the developer experience at the center. How we trained The hard part wasn’t the architecture; it was the data. We needed a model that could predict the next edit a developer might make, but no existing dataset captured real-time editing behavior. Our first attempt used internal pull request data. It seemed reasonable: pull requests contain diffs, and diffs look like edits. But internal testing revealed limitations. The model behaved overly cautiously—reluctant to touch unfinished code, hesitant to suggest changes to the line a user was typing, and often chose to do nothing. In practice, it performed worse than a vanilla LLM. That failure made the requirement clear: we needed data that reflected how developers actually edit code in the editor, not how code looks after review. Pull request data wasn’t enough because it: Shows only the final state, not the intermediate edits developers make along the way Lacks temporal ordering, so the model can’t learn when changes happen Contains almost no negative samples (cases where the correct action is “don’t edit”) Misses abandoned edits, in-progress rewrites, and other common editing behavior So we reset our approach and built a much richer dataset by performing a large-scale custom data collection effort that captured code editing sessions from a set of internal volunteers. We found data quality to be key at this stage: a smaller volume of high-quality edit data led to better models than those trained with a larger volume of data that was less curated. Supervised fine-tuning (SFT) of a model on this custom dataset produced the first model to outperform the vanilla models. This initial model provided a significant lift to quality and served as a foundation for the next several NES releases. Model refinement with reinforcement learning After developing several successful NES models with SFT, we focused on two key limitations of our training approach: SFT can teach the model what constitutes a good edit suggestion, but it cannot explicitly teach the model what makes an edit suggestion bad. SFT can effectively leverage labeled edit suggestions, but it cannot fully utilize the much larger number of unlabeled code samples. To address these two limitations, we turned to reinforcement learning (RL) techniques to further refine our model. Starting with the well-trained NES model from SFT, we optimized the model using a broader set of unlabeled data by designing a grader capable of accurately judging the quality of the model’s edit suggestions. This allows us to refine the model outputs and achieve higher model quality. The key ideas in the grader design can be summarized as follows: We use a large reasoning model with specific grading criteria. We routinely analyze model outputs to update the grading criteria, constantly searching for new qualities that indicate unhelpful edits. The grader should not only consider the correctness of the edit suggestion, but also strive to make the code diff displayed in the UI more user-friendly (easy to read). Continued post-training with RL has improved the model’s generalization capability. Specifically, RL extends training to unsupervised data, expanding the volume and diversity of data that we have available for training and removing the requirement that the ground truth next edit is known. This ensures that the training process consistently explores harder cases and prevents the model from collapsing into simple scenarios. Additionally, RL allows us to define our preferences through the grader, enabling us to explicitly establish criteria for “bad edit suggestions.” This enables the trained model to better avoid generating bad edit suggestions when faced with out-of-distribution cases. Lessons from training our latest custom NES model Our most recent NES release builds on that foundation with improvements to data, prompts, and architecture: Prompt optimization: NES runs many times per minute as you edit, so reducing the amount of context we send on each request has a direct impact on latency. We trimmed the prompt, reused more cached tokens between calls, and removed unneeded markup, which makes suggestions appear faster without reducing quality. Data quality filtering: Used LLM-based graders to filter out ambiguous or low-signal samples in order to reduce unhelpful or distracting suggestions. Synthetic data: Distilled data from larger models to train a smaller one without losing quality. Hyperparameter tuning: Tuned hyperparameters for the new base architecture to optimize suggestion quality. How we evaluate model candidates We train dozens of model candidates per month to ensure the version we ship offers the best experience possible. We modify our training data, adapt our training approach, experiment with new base models, and target fixes for specific feedback we receive from developers. Every new model goes through three stages of evaluation: offline testing, internal dogfooding, and online A/B experiments. Offline testing: We evaluate models against a set of targeted test cases to understand how well they perform in specific scenarios. Internal dogfooding: Engineers across GitHub and Microsoft use each model in their daily workflows and share qualitative feedback. A/B experiments: Subject the most promising candidates to a small percentage of real-world NES requests to track acceptance, hide, and latency metrics before deciding what to ship. Continuous improvements Since shipping the initial NES model earlier this year, we’ve rolled out three major model updates with each balancing speed and precision. April release: This release strongly improved model quality and restructured the response format to require fewer tokens. The result? Faster, higher-quality suggestions. May release: To address developer feedback that NES was showing too many suggestions, we improved suggestion quality and reduced the model’s eagerness to make changes. This led to more helpful suggestions and fewer workflow disruptions. November release: After testing nearly thirty candidate models over the summer—none of which were strong enough to replace the May model—this release finally cleared the bar in A/B testing. It delivers higher-quality suggestions with lower latency by shortening prompts, reducing response length, and increasing token caching. The table below summarizes the quality metrics measured for each release. We measure the rate at which suggestions are shown to developers, the rate at which developers accept suggestions, and the rate at which developers hide the suggestion from the UI. These are A/B test results comparing the current release with production. ReleaseShown rateAcceptance rateHide rateApril +17.9% +10.0% -17.5% May -18.8% +23.2% -20.0% November -24.5% +26.5% -25.6% Community feedback Developer feedback has guided almost every change we’ve made to NES. Early on, developers told us the model sometimes felt too eager and suggested edits before they wanted them. Others asked for the opposite: a more assertive experience where NES jumps in immediately and continuously. Like the tabs-vs-spaces debate, there’s no universal preference, and “helpful” looks different depending on the developer. So far, we’ve focused on shipping a default experience that works well for most people, but that balance has shifted over time based on real usage patterns: Reducing eagerness: We added more “no-edit” samples and tuned suggestion thresholds so the model only intervenes when it’s likely to be useful, not distracting. Increasing speed: Because NES runs multiple times per minute, we continue to reduce latency at the model, prompt, and infrastructure levels to keep suggestions inside the editing flow. Improving developer experience: We refined how edits are displayed, so suggestions feel visible but not intrusive, and expanded settings that let developers customize how NES behaves. Looking ahead, we’re exploring adaptive behavior where NES adjusts to each developer’s editing style over time, becoming more aggressive or more restrained based on interaction patterns (e.g., accepting, dismissing, or ignoring suggestions). That work is ongoing, but it’s directly informed by the feedback we receive today. As always, we build this with you. If you have thoughts on NES, our team would love to hear from you! File an issue in our repository or submit feedback directly to VS Code. What’s next Here’s what we’re building: Edits at a distance: Suggestions across multiple files—not just where you’re typing. Faster responses: Continued latency improvements across the model and infrastructure. Smarter edits: Better anticipation of context and cross-file dependencies. Experience faster, smarter next-edit suggestions yourself To experience the newest NES model, make sure you have the latest version of VS Code (and the Copilot Chat extension), then ensure NES is enabled in your VS Code settings. Try GitHub Copilot in VS Code > Acknowledgements A special thanks to Yuting Sun (CoreAI Post Training), Zeqi Lin (Core AI Post Training), Alexandru Dima (VS Code), Brigit Murtaugh (VS Code), and Soojin Choi (GitHub Copilot) for contributing to this blog post. We would also like to express our gratitude to the developer community for their continued engagement and feedback as we improve NES. Also, a massive thanks to all the researchers, engineers, product managers, and designers across GitHub and Microsoft who contributed (and continue to contribute) to model training, client development, infrastructure, and testing. The post Evolving GitHub Copilot’s next edit suggestions through custom model training appeared first on The GitHub Blog.
Read more →
Contrary to Rumors, Apple Will Continue Broadcasting ‘Friday Night Baseball’
2025-11-19T21:45:23Z | Source: Daring Fireball
Anthony Castrovince, reporting for MLB.com on the new broadcast rights agreement that will cover the next three seasons of baseball: Sunday Night Baseball will shift from ESPN, where it aired since 1990, to NBCUniversal, which also secured the rights to Sunday Leadoff and the Wild Card Series in the postseason for NBC and Peacock. Netflix will now air the T-Mobile Home Run Derby, an Opening Night exclusive and special event games set to include the 2026 MLB at Field of Dreams Game and the World Baseball Classic in Japan. And ESPN will receive a national midweek game package throughout the season while also acquiring the rights to sell MLB.TV, the league’s out-of-market streaming service that set a record with 19.4 billion minutes watched in 2025. [...] FOX/FS1 will continue to be the home of the All-Star Game and regular season games, as well as the World Series, League Championship Series, and Division Series presented by Booking.com. TBS will continue to house LCS and Division Series telecasts, plus regular season games on Tuesday nights. Apple TV will continue to stream “Friday Night Baseball” doubleheaders throughout the regular season. Back in August, Kendall Baker of Yahoo Sports reported: Apple is fully out. RIP Friday Night Baseball NBC/Peacock is in, for Friday and Sunday exclusive and Wild Card MLB TV being sold to ESPN (for a boatload of $$$) Netflix gets HR Derby He batted .750 on that tweet. ★
Read more →
Cloudflare’s Uptime and Scale
2025-11-19T20:29:24Z | Source: Daring Fireball
Miguel Arroz, on Mastodon: Unpopular opinion, apparently: companies like Cloudflare and Amazon provide very high quality services people and enterprises actually need, with a level of uptime and security vastly superior to what most of their customers would achieve on their own or using traditional providers. Their downtimes being so visible is a consequence of their success. A few readers have (very politely!) asked me whether yesterday’s outage (which made DF unreachable for, I think, about 90 minutes) made me rethink relying on a centralized provider like Cloudflare. My answer is no. Until I started using Cloudflare in 2018, Daring Fireball relied on no upstream service. I paid for a server from a web hosting provider (those providers changed a few times over the years), and when you, a reader, requested a page on this site, your browser communicated directly with my server via HTTP requests and my server responded directly back. The basic architecture of the World Wide Web is beautifully simple, and I embraced that simplicity with the way I hosted and served Daring Fireball. But the move away from HTTP to HTTPS added a lot of complexity. That complexity is probably worth it, overall, but it came at the price of simplicity. I originally made the switch to using Cloudflare as a caching front-end for Daring Fireball as a solution to an SSL-related slowdown that affected only some visitors in 2018. But I’d started using Cloudflare to handle my DNS the year before. Daring Fireball has always been a fast website and has always had very good uptime. That’s not because the back end is cleverly architected, but rather because it’s so simply architected. But DF’s overall uptime and the frequency of any sort of performance problems went from good to great when I started relying on Cloudflare as a proxy. Also, in recent years, bot traffic has exploded. (Thanks, AI.) I’m pretty sure my server could handle those bursts of traffic on its own, but I sleep better not having to worry about it, because Cloudflare handles mind-boggling amounts of traffic. ★
Read more →
Apple Announces Finalists for the 2025 App Store Awards
2025-11-19T19:48:21Z | Source: Daring Fireball
Apple Newsroom: Finalists in the Mac App of the Year category provided users with powerful tools to confidently take on new projects: Acorn, for being the go-to tool for pro-level photo edits. Essayist, for taking the stress out of sourcing and formatting academic papers. Under My Roof, for keeping homeowners organized and prepared. Nice to see Flying Meat’s Acorn — one of my own favorite and most-used apps since 2007, before it even shipped — getting this sort of recognition from Apple. Back in June, Apple featured Acorn in the WWDC keynote, during the preview of Liquid Glass. ★
Read more →
Cloudflare CEO Matthew Prince Explains, in Detail, and Apologizes for Yesterday’s Global Outage
2025-11-19T17:26:22Z | Source: Daring Fireball
Cloudflare CEO Matthew Prince: The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems’ permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network. The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail. After we initially wrongly suspected the symptoms we were seeing were caused by a hyper-scale DDoS attack, we correctly identified the core issue and were able to stop the propagation of the larger-than-expected feature file and replace it with an earlier version of the file. Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal. We are sorry for the impact to our customers and to the Internet in general. Given Cloudflare’s importance in the Internet ecosystem any outage of any of our systems is unacceptable. That there was a period of time where our network was not able to route traffic is deeply painful to every member of our team. We know we let you down today. This post is an in-depth recount of exactly what happened and what systems and processes failed. It is also the beginning, though not the end, of what we plan to do in order to make sure an outage like this will not happen again. Everything about this incident exemplifies why Cloudflare is one of my favorite companies in the world. Ideally, it wouldn’t have happened, but shit does happen. Among the things to note about Cloudflare’s response: They identified and fixed the issue quickly. They issued frequent updates to their status site while the incident remained ongoing. They published this postmortem within 24 hours. (That’s remarkable, given the technical breadth of the postmortem. Publishing this tomorrow, within 48 hours of the incident, would have been a praise-worthy accomplishment.) Update: Actually, according to Prince, commenting on Hacker News, the postmortem was published less than 12 hours after the incident began. Amazing. The postmortem starts with a cogent, well-written layperson’s explanation of what happened and why. The postmortem expands to include very specific technical details, including source code. Lastly, it’s worth noting that Prince put his own name on the postmortem (and wrote much of it himself, using BBEdit), and closed with this apology, taking personal responsibility: An outage like today is unacceptable. We’ve architected our systems to be highly resilient to failure to ensure traffic will always continue to flow. When we’ve had outages in the past it’s always led to us building new, more resilient systems. On behalf of the entire team at Cloudflare, I would like to apologize for the pain we caused the Internet today. This is how it’s done. ★
Read more →
Tim Cook Among Attendees of Last Night’s Black-Tie White House Dinner Honoring Journalist-Murdering Tyrant Mohammed bin Salman of Saudi Arabia
2025-11-19T16:46:26Z | Source: Daring Fireball
The New York Times: The world’s richest man. One of the world’s most famous soccer players. The president of soccer’s governing body. Dozens of executives from the finance, tech and energy sectors. These are some of the guests who attended President Trump’s black-tie dinner for Crown Prince Mohammed bin Salman of Saudi Arabia at the White House on Tuesday evening. The red carpet welcome for Prince Mohammed is an extraordinary moment in diplomatic relations with Saudi Arabia. It is his first visit to the United States since the 2018 killing of the Washington Post columnist Jamal Khashoggi, which U.S. intelligence determined the prince ordered. Prince Mohammed has denied involvement. Yours truly, back in August, after Tim Cook’s Oval Office gift of gold to Trump: It is disturbing to think that the leader of a beloved, trusted, and widely believed-to-be-ethical company like Apple has succumbed to avarice. That Tim Cook feels no qualms about — or perhaps even delights in — participating in a quid-pro-quo-driven corrupt administration in which flattery, fealty, gifts, and barely-concealed bribes are rewarded. That the United States devolving into kleptocracy suits Tim Cook just fine, because Apple’s pockets are deep enough to pay the vig. But the alternative is more disturbing. What if Tim Cook is, in fact, strong, proud, and driven by a keen sense of moral and ethical clarity? Perhaps Cook declined Trump’s invitation to join his Middle East entourage in May only because he was otherwise busy. But I believe there are bridges he will not cross — and that trip, especially its implicit and explicit praise and sanctification of the Saudi regime in general, and MBS in particular, was one of them. The whole trip was grotesque, and made a mockery of traditional American values. MBS being feted in the White House is even more grotesque. See also: Karen Attiah, who in her previous job as editor of The Washington Post’s global opinion section hired Jamal Khashoggi after he was exiled from Saudi Arabia, in The Guardian: “The Saudification of America Is Under Way”. ★
Read more →
Fragments Nov 19
2025-11-19T14:02:00-05:00 | Source: Martin Fowler
I’ve been on the road in Europe for the last couple of weeks, and while I was there Thoughtworks released volume 33 of our Technology Radar. Again it’s dominated by the AI wave, with lots of blips capturing our explorations of how to use LLMs and similar technology. “Agents” are the big thing these days but we’re also seeing growing movements in infrastructure orchestration, coding workflows - and the inevitable antipatterns. Many thanks to my colleagues for putting this together again. ❄ ❄ ❄ ❄ My trip to Europe started in Amsterdam, for a Thoughtworks event for a few of our clients there. Since I was in that lovely city, I got in touch with Gergely Orosz, host of The Pragmatic Engineer, and he arranged to record a podcast with me. No surprise that AI was front-and-center of the conversation, as I said it was the biggest shift I’d seen in programming during my career, comparable only to the shift to high-level languages, which even I am not old enough to have experienced. It was a fun chat and I really enjoyed myself. Gergely later joined myself James Lewis and Giles Edwards-Alexander at the Thoughtworks event the next day. ❄ ❄ ❄ ❄ My travels also took me to Nüremberg, where I attended an internal conference for Siemens on the future of software architecture. When we think of technology, it’s easy to focus on the Faangs of Silicon Valley, but Siemens have a huge workforce of software developers working on heavy engineering systems like trains and factory automation. It was good to hear them talk about federated architectures, data mesh, and their use of AI. ❄ ❄ ❄ ❄ I’ve often used pseudo-graphs to help explain why high quality software is cheaper. This time, Kent Beck creates a unique perspective to this chart, dispensing with the temporal axis to help think in terms of optionality. ❄ ❄ ❄ ❄ And in another life, Edward has finally finished the great migration of the Heavy Cardboard studio and returns to the tubes with our first game in the new digs. (No surprise that it’s Age of Steam.)
Read more →
How we’re making GitHub Copilot smarter with fewer tools
2025-11-19 20:00 | Source: GitHub Engineering
In VS Code, GitHub Copilot Chat can access hundreds of tools through the Model Context Protocol (MCP) that range from codebase analysis tools to Azure-specific utilities. But giving an agent too many tools doesn’t always make it smarter. Sometimes it just makes it slower. If you’ve ever seen this spinner in VS Code, you’ve hit the limits of a model that’s trying to reason across too many tools at once. To fix that, we’ve built two new systems—embedding-guided tool routing and adaptive tool clustering—and we’re rolling out a reduced toolset that trims the default 40 built-in tools down to 13 core ones. Across benchmarks like SWE-Lancer and SWEbench-Verified with both GPT-5 and Sonnet 4.5, these changes improve success rates by 2-5 percentage points. In online A/B testing, it reduces response latency by an average of 400 milliseconds. Too many tools impede agent intelligence The default toolset in VS Code consists of about 40 built-in tools, ranging from general command-line utilities to specialized tools for Jupyter Notebooks. With MCP servers included, that number can grow into the hundreds. Often, MCP servers bring in so many tools that they can exceed the API limits of some models. We’ve explored ways to filter down our toolset to provide only the tools most relevant to the user’s query, while not restricting the agent’s capabilities. Specifically, we needed to make sure we didn’t sacrifice the user’s experience to achieve lower latency. To accomplish this, we designed a middle-ground approach: “virtual tools.” This includes functionally grouping similar tools under one “virtual tool” the chat agent can expand as needed. Think of these as directories that contain related tools. This gives the model a general sense of what’s available without flooding it with hundreds of tool names. It also reduces the cache miss rate we’d expect if the model searched for individual tools, since it’s likely that similar tools are used and activated together. Applying lossless dynamic tool selection for MCP tools Adaptive tool clustering Initially we fed all the available tools into an LLM and asked it to group and summarize them. But this had two big issues: We couldn’t control the number of groups created, and it sometimes exceeded model limits It was extremely slow and incurred a huge token cost. The model would also sometimes ‘forget’ to categorize certain tools, forcing retries To tackle this issue, we applied our internal Copilot embedding model optimized for semantic similarity tasks to generate embeddings for each tool and group them using cosine similarity. This clustering method allowed precise, stable, and reproducible groups. As an example, here is one possible grouping of embeddings for the GitHub MCP server’s tools in the embedding space: We still use a model call to summarize each cluster, but this step is much faster and cheaper than asking the model to categorize everything from scratch. Tool embeddings and group summaries are cached locally, so recomputing them is comparatively cheap. Context-guided tool selection Once tools were grouped, we faced another problem: how does the model know which group to open without checking them all? We saw that, most of the time, the model would eventually find the right tool for its task. However, each call to a virtual tool still results in a cache miss, an extra round trip, and an opportunity for a small percentage of agent operations to fail. For example, when the user says: “Fix this bug and merge it into the dev branch.” The model often opens search tools, then documentation tools, then local Git tools, before finally realizing that it actually needs the merge tool inside the GitHub MCP tool group to complete the operation. Each incorrect group lookup adds latency and overhead, even though the correct group is fairly obvious from the context. To address this, we introduced Embedding-Guided Tool Routing. Before any tool group is expanded, the system compares the query embedding against vector representations of all tools (and their clusters), allowing it to pre-select the most semantically relevant candidates—even if they’re buried deep inside a group. With context-aware routing, we can infer from the beginning that the model is very likely to need the merge tool inside the GitHub MCP tool group, and include it directly in its candidate set—eliminating unnecessary exploratory calls and significantly reducing latency and failure rates. By surfacing only the most promising matches, we make the model’s search more targeted and reliable, while reducing redundant exploration. Embedding-based selection (powered by the Copilot Embedding model) We calculate the success of our embedding-based selection process via Tool Use Coverage, which measures how often the model already has the right tool visible when it needs it. In benchmarks, the embedding-based approach achieved 94.5% Tool Use Coverage, outperforming both LLM-based selection (87.5%) and the default static tool list (69.0%). Offline, this approach resulted in a 27.5% absolute improvement in coverage, clearly surpassing the LLM-based method while helping the agent reason faster and stay efficient. Online testing shows the same pattern: only 19% of Stable tool calls were successfully pre-expanded using the old method, whereas 72% of Insiders tool calls were pre-expanded thanks to the embedding-based matching. This confirms that the gains observed offline are consistently reflected in real-world usage. Less is more: shrinking the default toolset Even without hitting the model limits that massive MCP servers can trigger, an oversized built-in toolset still degrades performance. In offline benchmarks, we observed a 2–5 percentage point decrease in resolution rate on benchmarks including SWE-Lancer when the agent had access to the full built-in toolset. Behaviorally, the agent ends up ignoring explicit instructions, using tools incorrectly, and calling tools that are unnecessary to the task at hand. So, we trimmed the list. Based on tool usage statistics and performance data, we identified a core toolset of 13 essential tools. These tools encompass high-level repository structure parsing, file reading and editing, context searching, and terminal usage. The remaining, non-core built-in tools are grouped into four virtual categories: Jupyter Notebook Tools, Web Interaction Tools, VS Code Workspace Tools, and Testing Tools. This way, the model sees the smaller core set up-front and can expand groups only if necessary. As a result, users with the shrunken toolset experience an average decrease of 190 milliseconds in TTFT (Time To First Token), and an average decrease of 400 milliseconds in TTFT (Time to Final Token, or time to complete model response). A smaller toolset enables the agent to be more effective: simpler reasoning, faster response times, and better performance. Future directions: from tool selection to long-context reasoning As MCP systems evolve, the challenge isn’t just picking the right tool—it’s reasoning across time, context, and interactions. A truly intelligent model shouldn’t just react to queries; it should remember previous tool usage, infer intent from history, and plan multi-step actions over long sessions. In this sense, tool selection is an early form of long-context reasoning. The same mechanisms that help models route to the right tool today could, in the future, help them reason across thousands of turns helping them decide when to act, when to delegate, and when to stop. Our next step is to explore how embeddings, memory, and reinforcement signals can combine to create context-aware agents that learn how to use tools, not just which ones to pick. Want to see how Copilot uses MCP tools in action? Try GitHub Copilot now > Acknowledgments A big shoutout to our developer community for continuing to give us feedback and push us to deliver the best possible agent experiences with GitHub Copilot. A huge thanks also to Zijian Jin, a researcher on the team who helped to write this blog—and to the researchers, engineers, product managers across VS Code and GitHub Copilot for this work. (Also: We’re hiring applied researchers and software engineers, so feel free to apply!) The post How we’re making GitHub Copilot smarter with fewer tools appeared first on The GitHub Blog.
Read more →
My Foreword to "Frictionless"
2025-11-18T12:19:00-05:00 | Source: Martin Fowler
I find most writing on software productivity to be twaddle, but Nicole Forsgren and Abi Noda are notable exceptions. I had a chance to take a look at their new book, published today, and liked it so much I wrote a foreword. more…
Read more →
★ Meta Replaced the Native Windows WhatsApp App With a Shitty Web App
2025-11-16T00:57:02Z | Source: Daring Fireball
Mayank Parmar, writing for Windows Latest: WhatsApp on Windows 11 has just got a “major” upgrade, and you’re probably going to hate it because it simply loads web.whatsapp.com in a WebView2 container. This means WhatsApp on Windows 11 is cooked, and it’s back to being absolute garbage in terms of performance. WhatsApp is one of those Windows apps that went from being a web wrapper to a native app and then back to the web again after all these years of investment. WhatsApp for Windows was originally an Electron app, and it was eventually replaced with UWP after years of investment. Four years later, WhatsApp is going back to WebView2, abandoning the original WinUI/UWP native idea. [...] An app can use a lot of memory, and it does not necessarily mean it’s a performance nightmare, but the issue with the new WhatsApp is that it feels sluggish. You’re going to notice sluggish performance, long loading time, and other performance issues when browsing different conversations. We also noticed that it does not work well with Windows notifications. It also struggles with Windows 11’s Do Not Disturb mode or Active Hours. And there are delayed notifications problems as well. I found this post interesting on a few fronts. First, from the perspective of Meta. They replaced a shitty web app wrapper for Windows with a modern native Windows app, one that seemingly pleased Windows aficionados like Parmar. And now they’ve thrown that app away, going back to what that native app replaced four years ago: a web app wrapper that is bloated, slow, and unsurprisingly has poor support for native Windows features. It’s bad enough that so many large companies never even bother creating native apps, but it feels even worse to see a good native app abandoned. Second, it’s interesting reading Parmar’s list of gripes about the new web-app-wrapper WhatsApp app. All his gripes have merit, but it struck me that none of them are about the UI. Maybe the web app’s UI is actually fine? I have no idea. But I suspect it’s more that the Windows nerd mindset has UI design quality and adherence to recommended platform idioms way down on their list of priorities. That’s why they’re Windows users, not Mac users. Lastly, I wonder if this bodes poorly for the future of the current WhatsApp app for MacOS, a native app written using Mac Catalyst, Apple’s framework for porting iOS UIKit apps to the Mac. Like most Catalyst apps, WhatsApp for Mac isn’t a good Mac app. It doesn’t support the Services menu at all. It doesn’t let you open chats into standalone windows, or open more than one chat window. It opens its Settings right in its one main window. The whole “there’s only one window, and everything is in that one window” design is very iOS. The menu bar is a HIG prescriptivist’s nightmare. All the multi-word menu commands are in Sentence case rather than Title Case (except, of course, for the menu commands that come “free” with Catalyst — how do the developers of the app not notice this?), and the menu title order goes: File, Chat, Edit, Call, View, Window, Help (obviously it should be File, Edit, View, Chat, Call, Window, Help). Has there ever once, in 41 years, been a good Mac app that puts a menu between “File” and “Edit”? But, still, WhatsApp for Mac is a better Mac app than any Electron app I’ve ever used. Examining it now, it seems lightweight on both CPU usage and memory. It feels a bit better to me than either Signal or Beeper, both of which are developed using Electron, and both of which consume more RAM than WhatsApp. To name just one obvious nicety: when you send a new message in an older chat in WhatsApp, that chat animates as it moves to the top of the list of chats. It slides up, and other chats slide down as they re-sort. In the Signal and Beeper apps for Mac, an updated chat just zaps to the top of the chat list, with no animation at all. Gross. The question is, did Meta scrap its native Windows app because they don’t care that much about Windows in particular? Or because they don’t care that much about native desktop apps, period — and a crude web app wrapper is coming to Mac next? WhatsApp for Mac is currently the top-ranked free app in the Mac App Store — but it’s also the top-ranked free Windows app in the Microsoft Store. Meta did just ship a native Apple Watch app for WhatsApp, but if you want an app for WatchOS, it has to be native. You can’t ship a web app wrapper like an Electron app there. Personally, I won’t care too much if Meta shitcans the WhatsApp Mac app, because I barely use WhatsApp. But outside America, WhatsApp is the dominant messaging platform in much (most?) of the world. I’d be worried if I were a Mac user who uses WhatsApp heavily.
Read more →
The Learning Loop and LLMs
2025-11-04T09:14:00-05:00 | Source: Martin Fowler
Unmesh Joshi finds LLMs to be a useful tool, but explains why their help becomes illusory if we use them to shortcut the learning loop that's an essential part of our professional practice. more…
Read more →
Fragments Nov 3
2025-11-03T19:42:00-05:00 | Source: Martin Fowler
I’m very concerned about the security dangers of LLM-enabled browsers, as it’s just too easy for them to contain the Lethal Trifecta. For up-to-date eyes on these issues, I follow the writings of coiner of that phrase: Simon Willison. Here he examines a post on how OpenAI is thinking about these issues. My takeaways from all of this? It’s not done much to influence my overall skepticism of the entire category of browser agents, but it does at least demonstrate that OpenAI are keenly aware of the problems and are investing serious effort in finding the right mix of protections. ❄ ❄ ❄ ❄ Rob Bowley Unsurprisingly, there are a lot of strong opinions on AI assisted coding. Some engineers swear by it. Others say it’s dangerous. And of course, as is the way with the internet, nuanced positions get flattened into simplistic camps where everyone’s either on one side or the other. A lot of the problem is that people aren’t arguing about the same thing. They’re reporting different experiences from different vantage points. His view is that beginners are very keen on AI-coding but they don’t see the problems they are creating. Experienced folks do see this, but it takes a further level of experience to realize that when used well these tools are still valuable. Interestingly, I’ve regularly seen sceptical experienced engineers change their view once they’ve been shown how you can blend modern/XP practices with AI assisted coding. The upshot is this, is that you have be aware of the experience level of whoever is writing about this stuff - and that experience is not just in software development generally, but also in how to make use of LLMs. One thing that rings clearly from reading Simon Willison and Birgitta Böckeler is that effective use of LLMs is a skill that takes a while to develop. ❄ ❄ ❄ ❄ Charlie Brown and Garfield, like most comic strip characters, never changed over the decades. But Doonesbury’s cast aged, had children, and some have died (I miss Lacey). Gary Trudeau retired from writing daily strips a few years ago, but his reruns of older strips is one of the best things in the shabby remains of Twitter. A couple of weeks ago, he reran one of the most memorable strips in its whole run. The very first frame of Doonesbury introduced the character “B.D.”, a football jock never seen without his football helmet, or when on duty, his military helmet. This panel was the first time in over thirty years that B.D. was shown without a helmet, readers were so startled that they didn’t immediately notice that the earlier explosion had removed his leg. This set off a remarkable story arc about the travails of a wounded veteran. It’s my view that future generations will find Doonesbury to be a first-class work of literature, and a thoughtful perspective on contemporary America.
Read more →
Agentic AI and Security
2025-10-28T09:20:00-04:00 | Source: Martin Fowler
Agentic AI systems are amazing, but introduce equally amazing security risks. Korny Sietsma explains that their core architecture opens up security issues through what Simon Willison named the “Lethal Trifecta”. Korny goes on to talk about how to mitigate this through removing legs of the trifecta and splitting complex tasks. more…
Read more →
Fragments and Links
2025-10-21T11:02:00-04:00 | Source: Martin Fowler
Mathias Verraes writes about the relationship between Domains and Bounded Contexts in Domain-Driven Design. It’s a common myth that there should always be a 1:1 relationship between them, but although it’s sometimes the case, deeper modeling often exposes a more interesting structure. Gary Marcus: (NYT Gift Link) If the strengths of A.I. are truly to be harnessed, the tech industry should stop focusing so heavily on these one-size-fits-all tools and instead concentrate on narrow, specialized A.I. tools engineered for particular problems. Because, frankly, they’re often more effective. One of the truly annoying things about the US tax system is that we can’t easily file our tax returns electronically. In recent years an initiative called “Direct File” sought to fix that. Matt Bracken tells the story of how they developed a highly-regarded system in 25 states, but was canned by the Trump administration. He also explains how the creators of Direct File are working to prepare the ground for it to reappear. Security issues are only getting worse, but the US government agency for cybersecurity is having its staff reassigned to other duties. Detailed story in Bloomberg (paywalled) and an open (but more polemic) summary on Techdirt. Changes have hit particularly hard in CISA’s Capacity Building team, which writes emergency directives and oversees cybersecurity for the government’s highest value assets, the employees said. Defense and law enforcement are valuable things for a government to do, but here they seem to be walking away from a growing crisis.
Read more →
Understanding Spec-Driven-Development: Kiro, spec-kit, and Tessl
2025-10-15T08:33:00-04:00 | Source: Martin Fowler
Birgitta Böckeler has been trying to understand one of the latest AI coding buzzword: Spec-driven development (SDD). She looked at three of the tools that label themselves as SDD tools and tried to untangle what it means, as of now. more…
Read more →
Anchoring AI to a reference application
2025-09-25T12:57:00+03:00 | Source: Martin Fowler
Service templates are a typical building block in the “golden paths” organisations build for their engineering teams, to make it easy to do the right thing. The templates are supposed to be the role models for all the services in the organisation, always representing the most up to date coding patterns and standards. One of the challenges with service templates though is that once a team instantiated a service with one, it’s tedious to feed template updates back to those services. Birgitta Böckeler considers whether GenAI can help with that. more…
Read more →
To vibe or not to vibe
2025-09-23T10:53:00+03:00 | Source: Martin Fowler
Birgitta Böckeler examines the risk assessment around when to use vibe coding, using three dimensions of risk: Probability, Impact, and Detectability more…
Read more →
Some thoughts on LLMs and Software Development
2025-08-28T10:10:00-04:00 | Source: Martin Fowler
I’m about to head away from looking after this site for a few weeks (part vacation, part work stuff). As I contemplate some weeks away from the daily routine, I feel an urge to share some scattered thoughts about the state of LLMs and AI. ❄ ❄ ❄ ❄ I’ve seen a few early surveys on the effect AI is having on software development, is it really speeding folks up, does it improve or wreck code quality? One of the big problems with these surveys is that they aren’t taking into account how people are using the LLMs. From what I can tell the vast majority of LLM usage is fancy auto-complete, often using co-pilot. But those I know who get the most value from LLMs reckon that auto-complete isn’t very useful, preferring approaches that allow the LLM to directly read and edit source code files to carry out tasks. My concern is that surveys that ignore the different work-flows of using LLMs will produce data that’s going to send people down the wrong paths. (Another complication is the varying capabilities of different models.) ❄ ❄ ❄ ❄ I’m often asked, “what is the future of programming?” Should people consider entering software development now? Will LLMs eliminate the need for junior engineers? Should senior engineers get out of the profession before it’s too late? My answer to all these questions is “I haven’t the foggiest”. Furthermore I think anyone who says they know what this future will be is talking from an inappropriate orifice. We are still figuring out how to use LLMs, and it will be some time before we have a decent idea of how to use them well, especially if they gain significant improvements. What I suggest, is that people experiment with them. At the least, read about what others are doing, but pay attention to the details of their workflows. Preferably experiment yourself, and do share your experiences. ❄ ❄ ❇ ❄ I’m also asked: “is AI a bubble”? To which my answer is “OF COURSE IT’S A BUBBLE”. All major technological advances have come with economic bubbles, from canals and railroads to the internet. We know with near 100% certainty that this bubble will pop, causing lots of investments to fizzle to nothing. However what we don’t know is when it will pop, and thus how big the bubble will have grown, generating some real value in the process, before that happens. It could pop next month, or not for a couple of years. We also know that when the bubble pops, many firms will go bust, but not all. When the dot-com bubble burst, it killed pets.com, it killed Webvan… but it did not kill Amazon. ❄ ❄ ❄ ❄ I retired from public speaking a couple of years ago. But while I don’t miss the stress of giving talks, I do miss hanging out with my friends in the industry. So I’m looking forward to catching up with many of them at GOTO Copenhagen. I’ve been involved with the GOTO conference series since the 1990s (when it was called JAOO), and continue to be impressed with how they put together a fascinating program. ✢ ❄ ❄ ❄ My former colleague Rebecca Parsons, has been saying for a long time that hallucinations aren’t a bug of LLMs, they are a feature. Indeed they are the feature. All an LLM does is produce hallucinations, it’s just that we find some of them useful. One of the consequences of this is that we should always consider asking the LLM the same question more than once, perhaps with some variation in the wording. Then we can compare answers, indeed perhaps ask the LLM to compare answers for us. The difference in the answers can be as useful as the answers themselves. Certainly if we ever ask a hallucination engine for a numeric answer, we should ask it at least three times, so we get some sense of the variation. Furthermore we shouldn’t ask an LLM to calculate an answer than we can calculate deterministically (yes, I’ve seen this). It is OK to ask an LLM to generate code to calculate an answer (but still do it more than once). ❄ ❄ ❄ ❄ Other forms of engineering have to take into account the variability of the world. A structural engineer builds in tolerance for all the factors she can’t measure. (I remember being told early in my career that the unique characteristic of digital electronics was that there was no concept of tolerances.) Process engineers consider that humans are executing tasks, and will sometimes be forgetful or careless. Software Engineering is unusual in that it works with deterministic machines. Maybe LLMs mark the point where we join our engineering peers in a world on non-determinism. ❄ ❄ ❄ ❄ I’ve often heard, with decent reason, an LLM compared to a junior colleague. But I find LLMs are quite happy to say “all tests green”, yet when I run them, there are failures. If that was a junior engineer’s behavior, how long would it be before H.R. was involved? ❄ ❄ ❄ ❄ LLMs create a huge increase in the attack surface of software systems. Simon Willison described the The Lethal Trifecta for AI agents: an agent that combines access to your private data, exposure to untrusted content, and a way to externally communicate (“exfiltration”). That “untrusted content” can come in all sorts of ways, ask it to read a web page, and an attacker can easily put instructions on the website in 1pt white-on-white font to trick the gullible LLM to obtain that private data. This is particularly serious when it comes to agents acting in a browser. Read an attacker’s web page, and it could trick the agent to go to your bank account in another tab and “buy you a present” by transferring your balance to the kind attacker. Willison’s view is that “the entire concept of an agentic browser extension is fatally flawed and cannot be built safely”.
Read more →
From Black Box to Blueprint
2025-08-28T07:24:00-04:00 | Source: Martin Fowler
A common enterprise problem: crucial legacy systems become “black boxes”—key to operations but opaque and risky to touch. Thiyagu Palanisamy and Chandirasekar Thiagarajan worked with a client to use AI-assisted reverse engineering to reconstruct functional specifications from UI elements, binaries, and data lineage to overcome analysis paralysis. They developed a methodical “multi-lens” approach—starting from visible artifacts, enriching incrementally, triangulating logic, and always preserving lineage. Human validation remains central to ensure accuracy and confidence in extracted functionality. This engagement revealed that turning a system from black box to blueprint empowers modernization decisions and accelerates migration efforts. more…
Read more →
Research, Review, Rebuild: Intelligent Modernisation with MCP and Strategic Prompting
2025-08-27T10:15:00-04:00 | Source: Martin Fowler
The Bahmni open-source hospital management system was began over nine years ago with a front end using AngularJS and an OpenMRS REST API. Rahul Ramesh wished to convert this to use a React + TypeScript front end with an HL7 FHIR API. In exploring how to do this modernization he used a structured prompting workflow of Research, Review, and Rebuild - together with Cline, Claude 3.5 Sonnet, Atlassian MCP server, and a filesystem MCP server. Changing a single control would normally take 3–6 days of manual effort, but with these tools was completed in under an hour at a cost of under $2. more…
Read more →
Building your own CLI Coding Agent with Pydantic-AI
2025-08-27T07:50:00-04:00 | Source: Martin Fowler
CLI coding agents are a fundamentally different tool to chatbots or autocomplete tools - they're agents that can read code, run tests, and update a codebase. Ben O'Mahony explains that while commercial tools are impressive, they don't understand the particular context of our environment and the eccentricities of our specific project. Instead we can build our own coding agent by assembling open source tools, using our specific development standards for: testing, documentation production, code reasoning, and file system operations. more…
Read more →
Chatting with Unmesh about building language with LLMs
2025-08-26T09:26:00-04:00 | Source: Martin Fowler
A few weeks ago, Unmesh Joshi and I started having a conversation about how he likes to grow a language of abstractions when working with an LLM. We thought this was a conversation that others might find interesting so we turned it into an article. We talk about how programming is about both building and applying abstractions and how the LLM helps us in different ways with each activity. more…
Read more →
Bliki: Expansion Joints
2025-08-18T00:00:00-04:00 | Source: Martin Fowler
Back in the days when I did live talks, one of my abilities was to finish on time, even if my talk time was cut at the last moment (perhaps due to the prior speaker running over). The key to my ability to do this was to use Expansion Joints - parts of the talk that I'd pre-planned so I could cover them quickly or slowly depending on how much time I had. The way I'd do this would be to plan for some topics to be optional. The talk would work if I skipped over them, but I could also witter on about them for five (or ten) minutes. Ideally, each of these topics would get one slide, usually with a bunch of key phrases on it - the headings of what I'd talk about should I be talking about it. When I got to the slide, I'd look at how time was going with the talk. If (as was usually the case) I was running short of time, I could cover the slide in about thirty seconds, saying something like: “in doing this, there's a bunch of things you need to consider, but they are out of scope for today's talk”. If, however, I did have time, I could then spend some time talking about them. The slide would be simple, and not provide much of a Visual Channel, but that wasn't so important, after all this material was optional in the first place. The single flex-slide was my favorite Expansion Joint, as it was easy to use. Sometimes however my optional topic required a proper visual channel, necessitating dedicated slides. My solution here was good control over slide handling. Presentation tools include the ability to skip over slides while I'm talking, and I made sure I practiced how to use them so I could skip a bunch of slides without the audience knowing. It's crucial here that it's invisible to the audience, I find it looks sloppy if anyone says “in the interests of time I'll skip over these slides”. To do this, however, I do need access to my laptop while presenting, venues that only provide a clicker while loading the slides on some other machine lack that control. That started to happen in my last couple of years, much to my annoyance. When creating talks, I was always worried that I would run out of things to say, even though experience told me I reliably crammed more stuff in than I could possibly cover. Expansion Joints helped with this, I could aggressively trim the core talk to less than I needed, and rely on the Expansion Joints to fill the gap. In practice I usually didn't need the Expansion Joints anyway, but their presence helped my confidence. Using Expansion Joints was particularly important for me as I never rehearsed my talks. I was always someone whose ability to present was driven by adrenaline. Talking to a rubber duck just didn't work, the duck was clearly every bit as bored as I was. Consequently the first time I gave a talk, I was hazy as to how long it would take. Yet with Expansion Joints in place, I was able to finish a talk right on time. Expansion Joints enabled me to give the same talk to different time slots. Sometimes I'd have thirty minutes, sometimes forty-five. With Expansion Joints, I didn't need to change my slides, particularly handy if a time cut (or more rarely a time increase) appeared at the last moment. (Although in my later years, I handled this by doing a Suite Of Talks.) Talks that encourage audience interaction need these because we can never predict how much time the interaction will use up. Sometimes we get a steady stream of questions, other times (particularly in Scandinavia, or upper-Midwest America) a lack of questions had me blasting through the agenda. Any such talk needed a double-dose of this temporal ballast. Expansion Joints are at their most useful in later parts of the talk, as it's then that I have the most information on how much time I have. Earlier ones can still be handy, particularly if they come after an interactive section when I'd like to rebase my timing. Further Reading The name was coined by Neal Ford, Matthew McCullough, and Nathaniel Schutta in their excellent book Presentation Patterns.
Read more →
Team OKRs in Action
2025-08-13T10:16:00-04:00 | Source: Martin Fowler
OKRs have become a popular way to connect strategy with execution in large organizations. But when they are set in a top‑down cascade, they often lose their meaning. Teams receive objectives they didn’t help create, and the result is weak commitment and little real change. Paulo Caroli describes how high‑performing teams can work in another way. They define their own objectives in an organization that uses a collaborative process to align the team’s OKRs with the broader strategy. With these Team OKRs in place, they create a shared purpose and become the base for a regular cycle of planning, check‑ins, and retrospectives. more…
Read more →
Impact Intelligence, addressing common objections
2025-08-12T09:02:00-04:00 | Source: Martin Fowler
Sriram Narayan concludes his article in impact intelligence by addressing five common objections to this activity, including slowing down, lack of agility and collaboration, and the unpredictability of innovation. more…
Read more →
Quick but worthwhile links
2025-08-07T09:21:00-04:00 | Source: Martin Fowler
Abi Noda observes Just met with a 2000+ eng company. Their developers are saving 2+ hours per week thanks to Copilot. But they’re also losing: 3 hrs per week due to slow builds 4 hrs per week on dev environment toil 2 hrs per week waiting for code reviews AI is not a silver bullet. Nik Malykhin found it useful to get an AI assistant to write its own coding rules by analyzing his code, and then asking it to refine them as worked with it. the central paradox of using AI assistants effectively: to offload cognitive work to an AI, you must first do the meta-cognitive work of codifying your own development philosophy and collaboration style. I agree with Charity Majors that there is a valuable distinction between disposable versus durable code, and that makes a difference in how we use AI with it. The difference between disposable code and durable code is not about whether the code was generated by AI or written by a human, or even how difficult it was to write. The cost is defined by the standards you are building to, and the rest of the software development lifecycle: how well you expect to maintain it, extend it, migrate it, understand its behavior, or fix it when it breaks. This is the expensive part of software development, the type that requires deep expertise and familiarity with your language and environment. Disposable code is cheap because you don’t even try to maintain it. Jim Highsmith thinks that we should think of AI as Alternative Intelligence It’s not fake intelligence, or artificial empathy, or HAL 9000 with manners. It’s something else. Something that thinks differently, not defectively. Rod Johnson asserts that we know that memory is important to AI systems, but we forget that Domain Models are an important form of memory Event Sourcing provides perfect episodic memory by storing the complete history of domain changes as immutable events. Every decision, every state transition, every business event is preserved with full context. Repository patterns offer domain-focused memory interfaces that understand business concepts. A CustomerRepository knows how to retrieve customer information in ways that preserve business meaning, not just raw data. Bounded contexts from Domain-Driven Design partition memory into semantic boundaries, preventing the concept pollution that plagues pure vector-based approaches. Aggregates function as cohesive memory clusters with consistency boundaries—exactly what we need for reliable agent behavior.
Read more →
Actions to improve impact intelligence
2025-08-07T09:20:00-04:00 | Source: Martin Fowler
Sriram Narayan continues his article on impact intelligence by outlining five actions that can be done to improve impact intelligence: introduce robust demand management, pay down measurement debt introduce impact validation, offer your CFO/COO an alternative to ROI, equip your teams. more…
Read more →
The Reformist CTO’s Guide to Impact Intelligence
2025-08-06T09:23:00-04:00 | Source: Martin Fowler
The productivity of knowledge workers is hard to quantify and often decoupled from direct business outcomes. The lack of understanding leads to many initiatives, bloated tech spend, and ill-chosen efforts to improve this productivity. Sriram Narayan begins an article that looks at how to avoid this by developing an intelligence of the business impact of their work across a network connecting output to proximate and downstream impact. more…
Read more →
How far can we push AI autonomy in code generation?
2025-08-05T09:53:00-04:00 | Source: Martin Fowler
Birgitta Böckeler reports on a series of experiments we did to explore how far Generative AI can currently be pushed toward autonomously developing high-quality, up-to-date software without human intervention. As a test case, we created an agentic workflow to build a simple Spring Boot application end to end. We found that the workflow could ultimately generate these simple applications, but still observed significant issues in the results—especially as we increased the complexity. The model would generate features we hadn't asked for, make shifting assumptions around gaps in the requirements, and declare success even when tests were failing. We concluded that while many of our strategies — such as reusable prompts or a reference application — are valuable for enhancing AI-assisted workflows, a human in the loop to supervise generation remains essential. more…
Read more →
Partner with the AI, throw away the code
2025-07-31T10:16:00-04:00 | Source: Martin Fowler
Matteo Vaccari shows why the common metric of AI code acceptance has big hole. An LLM can be helpful even if you throw away its code. more…
Read more →
Who is LLM
2025-07-22T13:38:00+05:30 | Source: Martin Fowler
It's become a common habit for developers to give Large Language Models (LLMs) a persona when working with them. I describe four of them, a stubborn donkey, a genie, a slot machine, and Uriah Heep. more…
Read more →
Generative AI in software and essaying
2025-07-21T14:58:00-04:00 | Source: Martin Fowler
Korny Sietsma has a great example of how using an LLM for coding is very helpful but with limitations… and a thoughtful general essay on why the hype and the immovable skeptics are both missing the train. While here, a professor of poetry ponders (gift link) on the value and limits of AI with writing: One of the real challenges here is the way that A.I. undermines the human value of attention, and the individuality that flows from that. What we stand to lose is not just a skill but a mode of being: the pleasure of invention, the felt life of the mind at work.
Read more →
Three worthwhile articles yesterday
2025-07-10T10:58:00-04:00 | Source: Martin Fowler
Three articles I enjoyed yesterday: Stephen O’Grady talks about how Gen AI tools break two common constants with developer tools: they are willing to flit between Gen AI tools and they are willing to pay for them. This implies that it’s not too late for new tools to appear, and that enterprise adoption will be slowed by a lack of consensus on which direction to go. Pete Hodgson continues his excellent writing on Gen AI by proposing an approach to leading engineers towards an AI-assisted future, centered around a the concept of aligned autonomy. He advocates an explicit experimentation phase, followed by supporting adoption and measuring their impact. Charity Majors reflects on her career. I really resonated with her words: “I think I’m less interested in my own happiness (whatever that means) than I am interested in doing work that feels worth doing.”
Read more →
I still care about the code
2025-07-09T10:33:00-04:00 | Source: Martin Fowler
Even with LLMs, Birgitta Böckeler still cares about the code: “LLMs are NOT compilers, interpreters, transpilers or assemblers of natural language, they are inferrers. more…
Read more →
Why Organizations Need Expert Generalists
2025-07-02T10:05:00-04:00 | Source: Martin Fowler
In complex environments, the characteristics of Expert Generalists lead Gitanjali, and I thus complete our article by summarizing the value of them to be particularly valuable in driving tasks to completion. Unmesh, this skill. more…
Read more →
Expert Generalists need specialists (and LLMs)
2025-07-01T09:17:00-04:00 | Source: Martin Fowler
While we've spent this article praising the Expert Generalist, Unmesh, Gitanjali, and I simultaneously do not deny the value of specialist knowledge. To be the most efficient, a team needs some specialist skill. We've also observed that Expert Generalist capabilities are considerably more valuable when working with LLMs. more…
Read more →
Growing Expert Generalists
2025-06-25T08:48:00-04:00 | Source: Martin Fowler
To grow Expert Generalists we need to focus attention on fundamentals rather tools. As an example, Unmesh, Gitanjali, and I describe a workshop we've used to break silos of application development, data engineering, and devops more…
Read more →