Multi-turn conversations with Action-Based Contrastive Self-Training

June 6, 2025

Are action-based preferences necessary? One of the key factors of ACT is that the contrastive pairs highlight differences between conversational actions. In “ACT w/ Random Actions”, we additionally examine the importance of action selection by randomly sampling both the winning and losing action when constructing the preference pair, and observe this underperforms normal ACT.

Do we need on-policy sampling? In “ACT w/o on-policy sampling”, we examine the importance of on-policy sampling by evaluating normal off-policy DPO on the dataset as constructed in Phase 1. While we do observe some improvements over SFT (e.g., from 69.0 to 74.8 Macro F1), the overall improvements are much larger when using on-policy sampling as with full ACT. This may be due to the fact that the off-policy negative responses are not guaranteed to lie in the language manifold of the policy model, and distribution shift may be too difficult to overcome with off-policy learning.

Is trajectory simulation necessary? ACT is better-aligned with multi-turn conversations due to its trajectory simulation. Without multi-turn simulation, our approach can be viewed similarly to on-policy DPO variants like IRPO, but with a conversation-specific reward signal which accounts for conversation actions and task heuristics. In “ACT w/ sampling w/o simulation”, we find that this trajectory-level simulation is critical to improving multi-turn performance, especially the policy model’s ability to reason about its own clarification questions.

Is ACT model agnostic? The base model in our main experiments, Zephyr, is obtained by aligning Mistral. In “ACT with unaligned foundation models” we observe a performance gap of 6.5 Action F1 and 4.3 Trajectory F1 after ACT tuning for the two models. However, our results demonstrate ACT can improve performance regardless of pre-existing alignment with human feedback, although it can help as an improved model initialization. Overall, we find that improving base model performance with ACT is model agnostic.

Software in the Age of AI – O’Reilly

December 4, 2025

aitoolsadmin

In 2025 AI reshaped how teams think, build, and deliver software. We’re now at a point where “AI coding assistants have quickly moved from novelty to necessity [with] up to…

Artificial Intelligence

The New Benchmark for Auditory Intelligence

December 4, 2025

aitoolsadmin

Sound is a critical part of multimodal perception. For a system — be it a voice assistant, a next-generation security monitor, or an autonomous agent — to behave naturally, it…

Multi-turn conversations with Action-Based Contrastive Self-Training

Leave a Reply Cancel reply

Software in the Age of AI – O’Reilly

The New Benchmark for Auditory Intelligence

The Role of AI and Robotics in Improving Efficiency in Manufacturing

The Future of Work: How AI and Robotics are Changing Industries

The Impact of Artificial Intelligence and Robotics on Society

Advancements in AI and Robotics: What to Expect in the Coming Years

The best AI productivity tools in 2026

AI governance: What it is + why it’s important

December 4, 2025 – Rates Decrease – Forbes Advisor

Best Home Warranty Companies In Oregon Of 2025 – Forbes Advisor

December 4, 2025 – Rates Decrease – Forbes Advisor

From the NFL to Startup Battlefield: How Alltroo built a brand that wins