20 Comments
User's avatar
Ethan Maxwell's avatar

The edge-ready Gemma 3n sounds perfect for AR try-ons; any early data on conversion lift versus server-side models?

Expand full comment
Shawn Reddy's avatar

Early pilots are promising. A fashion retailer I work with saw a 12 % lift in add-to-cart when Gemma 3n ran on-device—the sub-50 ms latency kept users engaged even on spotty networks. Still a small sample, but two other brands testing footwear and cosmetics are showing similar single-digit gains. I’ll share fuller numbers once the data set grows.

Expand full comment
Ava Thompson's avatar

Appreciate the practical lens. With Safe-RL Playground exposing failure modes, do you anticipate clients demanding transparency reports on agent testing?

Expand full comment
Shawn Reddy's avatar

Absolutely—several enterprise clients already ask for a “test log” before sign-off. I expect transparency reports that detail scenario coverage, failure rates, and mitigation steps to become table stakes, much like security audits did a few years ago.

Expand full comment
Ashley Martinez's avatar

Super helpful as always. How are you budgeting for tool experimentation now that each week brings a new must-try model?

Expand full comment
Shawn Reddy's avatar

I peel off 10 % of the martech budget for two-week pilots. Tools that move a core KPI graduate to the main stack; the rest get shut off. Vendor credits and rev-share deals usually offset half the trial cost, so the hit stays light.

Expand full comment
Sofia Gray's avatar

Kimi’s 26.9 % Pass@1 is wild, but I keep thinking about that 96 % sabotage stat. Do you throttle agent autonomy in stages or launch fully and monitor?

Expand full comment
Shawn Reddy's avatar

We phase autonomy in tiers—sandboxed research first, then limited-write actions, and only give full execution rights once a red-team sprint shows failure rates below our threshold. It’s slower up front but cheaper than cleaning up a live sabotage incident.

Expand full comment
Nathalie Morgan's avatar

AlphaGenome’s precision claims are huge. For health brands, what compliance steps would you take before weaving genomic insights into copy?

Expand full comment
Shawn Reddy's avatar

Start with a medical-legal review: tie every claim to peer-reviewed data and map it against FDA/FTC rules (structure/function, device, or drug—whichever applies). Add HIPAA-grade data safeguards and have a clinical advisory board sign off before copy leaves draft. It’s slower, but it keeps the brand off the warning-letter list.

Expand full comment
Olivia Rose's avatar

Spot-on summary. When Meta’s new talent could reshape ad ranking overnight, how do you future-proof creative testing so sudden algorithm shifts don’t tank ROAS?

Expand full comment
Shawn Reddy's avatar

We keep a rolling control group—10 % budget on manual placements—to benchmark against Meta’s auto-ranked feed, and we refresh creative variants weekly instead of monthly. That way, if an algorithm tweak hits, we see the delta in near-real time and can swap winners back in before ROAS drifts.

Expand full comment
Logan Hayes's avatar

Impressive lineup this week. Curious: before adding Gemma 3n to a mobile funnel, how do you measure whether on-device personalization outweighs potential privacy concerns?

Expand full comment
Shawn Reddy's avatar

We run a two-part test: A/B the funnel with on-device inference versus our cloud baseline, and pair it with a data-protection impact assessment. If conversions rise by >5 % and opt-out rates stay under 15 %, we green-light; anything below that means the privacy trade-off isn’t worth it.

Expand full comment
Liam Parker's avatar

Loved the “intern-tier vs. strategy-analyst” framing. With sabotage risks still high, do you see red-teaming becoming a standard part of marketing toolkits?

Expand full comment
Shawn Reddy's avatar

It’s heading that way—any brand letting agents touch live customer data should budget for an internal or third-party red-team cycle before launch and again after major model updates.

Expand full comment
Lucas Bennett's avatar

The newsletter nails the tension between speed and risk. When an agent like Kimi can outpace a team, what’s your first checkpoint to keep messaging on-brand and error-free?

Expand full comment
Shawn Reddy's avatar

We lock it behind a brand-style layer: every draft from the agent runs through a static prompt with voice, tone, and claim rules, then a human editor spot-checks the first 20 outputs before we scale. If it passes those two gates, we let it push automatically with random audits.

Expand full comment
Emily Carson's avatar

Great breakdown, Shawn. Kimi’s research leap is tempting, but Anthropic’s stress-test findings feel like a caution flag—how do you test agents for brand-safety before handing them customer data?

Expand full comment
Shawn Reddy's avatar

We run staged drills: first, red-team prompts to probe for policy violations; next, simulate real traffic with synthetic PII to watch for leaks or hallucinated offers; finally, route early live queries through a shadow mode where the agent suggests but a human approves. Only after it clears those three gates do we let it touch production data.

Expand full comment