Skip to content

Improved Language Models Learner Autonomously Enhance Themselves

Utilizing the data from preferences, criteria can be automatically extracted and integrated into prompts.

Language models learning autonomous enhancement for self-refinement
Language models learning autonomous enhancement for self-refinement

Improved Language Models Learner Autonomously Enhance Themselves

In a groundbreaking development, researchers from the University of Illinois and Google have proposed a novel approach called Preference Implicit Training (PIT) that enables large language models (LLMs) to learn self-improvement from human preference data. This method allows LLMs to internalize desirable behaviours through their own internal representations, reducing the reliance on explicit prompts at inference time.

The key insight behind PIT is the utilisation of human feedback signals embedded in training data. Instead of manually distilling criteria into prompts, this implicit information can be leveraged to teach the LLM what constitutes an improvement in quality.

PIT employs curriculum reinforcement learning, starting with easy-to-improve references, such as human-labeled bad responses, and then switching to samples drawn from the LLM itself. The training data that indicates human preferences between good and bad responses already provides implicit guidance on the dimension of improvement, allowing the training of a reward model to judge quality gaps without hand-engineering criteria into prompts.

Experimental results validate PIT's capabilities on two real-world dialog datasets and one synthetic instruction-following dataset. Across conditions, PIT improved response quality by 7-34% compared to the original LLM samples as measured by third-party evaluator models. Moreover, PIT significantly outperforms prompting methods, demonstrating its advantages in self-improvement.

The results strongly demonstrate PIT's capabilities for self-improvement and its potential advantages over prompting approaches. This development could be crucial as these models increase in capabilities and are deployed in sensitive real-world applications. PIT provides a way forward to learn nuanced goals like improving helpfulness, harmlessness, and accuracy by tapping into the implicit guidance within training data.

Further human evaluations reveal that PIT outperforms the prompting method Self-Refine. The standard RLHF objective optimises a language model policy to maximise the expected quality of generated responses, while PIT maximises the gap in quality between the original response and an improved response conditioned on having the original as a reference point.

The autonomous self-improvement facilitated by PIT will be vital as these models expand access to niche domains or under-served use cases that lack resources for oversight. By reducing reliance on human intervention, PIT opens the door to LLMs that continuously align better with human values as they learn from experience.

Analysis of the impact of sampling temperature during generation finds lower temperatures around 0.4-0.6 work best for PIT, while prompting methods need higher diversity to avoid just re-generating the original response.

In conclusion, the PIT approach represents a significant step forward in the development of LLMs. By enabling these models to learn self-improvement from human preference data, PIT offers a promising avenue for creating more intelligent and responsive AI systems that can adapt to the nuances of human interaction and improve over time without constant human intervention.

[1] Meta-learning and self-feedback mechanisms in LLMs [2] Experiments analyzing model honesty with human and simulated preference data

Artificial intelligence, such as large language models (LLMs), can leverage the Preference Implicit Training (PIT) method to learn self-improvement from human preference data. This approach utilizes human feedback signals within training data to teach the LLM about improvements in quality, reducing the need for explicit prompts.

PIT's capabilities are validated by experimental results, outperforming prompting methods and improving response quality by 7-34%. This innovation could be pivotal for LLMs, as it allows them to continuously align with human values while reducing reliance on human intervention, especially in niche domains or under-served use cases.

Read also:

    Latest