Value learning

Value learning is a research area within artificial intelligence (AI) and AI alignment that focuses on building systems capable of inferring, acquiring, or learning human values, goals, and preferences from data, behavior, and feedback. The aim is to ensure that advanced AI systems act in ways that are beneficial and aligned with human well-being, even in the absence of explicitly programmed instructions.^[1]^[2]

Motivation

The motivation for value learning stems from the observation that humans are often inconsistent, unaware, or imprecise about their own values. Hand-coding a complete ethical framework into an AI is considered infeasible due to the complexity of human norms and the unpredictability of future scenarios. Value learning offers a dynamic alternative, allowing AI to infer and continually refine its understanding of human values from indirect sources such as behavior, approval signals, and comparisons.^[3] ^[4]

Key approaches

One central technique is inverse reinforcement learning (IRL), which aims to recover a reward function that explains observed behavior. IRL assumes that the observed agent acts (approximately) optimally and infers the underlying preferences from its choices.^[5]^[6]

Cooperative inverse reinforcement learning (CIRL) extends IRL to model the AI and human as cooperative agents with asymmetric information. In CIRL, the AI observes the human to learn their hidden reward function and chooses actions that support mutual success.^[7] ^[8]

Another approach is preference learning, where humans compare pairs of AI-generated behaviors or outputs, and the AI learns which outcomes are preferred. This method underpins successful applications in training language models and robotics.^[9]^[10]

Concept alignment

A major challenge in value learning is ensuring that AI systems interpret human behavior using similar conceptual models. Recent research distinguishes between "value alignment" and "concept alignment," the latter referring to the internal representations that humans and machines use to describe the world. Misalignment in conceptual models can lead to serious errors even if value inference mechanisms are accurate.^[11]

Challenges

Value learning faces several difficulties:

Ambiguity of human behavior – Human actions are noisy, inconsistent, and context-dependent.^[12]
Reward misspecification – The inferred reward may not fully capture human intent, particularly under imperfect assumptions.^[13]
Scalability – Methods that work in narrow domains often struggle with generalization to more complex or ethical environments.^[14]

Hybrid and cultural approaches

Recent work highlights the importance of integrating diverse moral perspectives into value learning. One framework, **HAVA** (Hybrid Approach to Value Alignment), incorporates explicit (e.g., legal) and implicit (e.g., social norm) values into a unified reward model.^[15] Another line of research explores how inverse reinforcement learning can adapt to culturally specific behaviors, such as in the case of "culturally-attuned moral machines" trained on different societal norms.^[16]

Applications

Value learning is being applied in:

Robotics – Teaching robots to cooperate with humans in household or industrial tasks.^[17]
Large language models – Aligning chatbot behavior with user intent using preference feedback and reinforcement learning.^[18]
Policy decision-making – Informing AI-assisted decisions in governance, healthcare, and safety-critical environments.^[19]

References

^ Russell, Stuart (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.
^ "Reward Models in Deep Reinforcement Learning: A Survey". June 2025.
^ Ng, Andrew Y.; Stuart Russell (2000). Algorithms for Inverse Reinforcement Learning (PDF). Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000). Stanford, CA, USA: Morgan Kaufmann. pp. 663–670.
^ Christiano, Paul F.; Jan Leike; Tom B. Brown; Miljan Martic; Shane Legg; Dario Amodei (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems 30 (NeurIPS 2017). Curran Associates, Inc. pp. 4299–4307.
^ Ng, Andrew Y.; Stuart Russell (May 2000). "Algorithms for Inverse Reinforcement Learning" (PDF). Proceedings of the Seventeenth International Conference on Machine Learning (ICML). Stanford, CA: Morgan Kaufmann: 663–670.
^ "Advances and applications in inverse reinforcement learning: a comprehensive review". Neural Computing and Applications. 37: 11071–11123. 26 March 2025. doi:10.1007/s00521-025-11100-0. Retrieved 24 June 2025.
^ Hadfield-Menell, Dylan; Anca Dragan; Pieter Abbeel; Stuart Russell (5 December 2016). Cooperative Inverse Reinforcement Learning. Proceedings of the 30th International Conference on Neural Information Processing Systems (NeurIPS 2016). Curran Associates, Inc. pp. 3916–3924. Retrieved 24 June 2025.
^ Malik, Asma; et al. (2018). "Efficient Bellman Updates for Cooperative Inverse Reinforcement Learning". AAAI. {{cite conference}}: Explicit use of et al. in: |author= (help); Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
^ Christiano, Paul F.; et al. (2017). "Deep reinforcement learning from human preferences". NeurIPS. {{cite conference}}: Explicit use of et al. in: |author= (help); Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
^ "Reward Models in Deep Reinforcement Learning: A Survey".
^ Rane, Aditya; et al. (2023). "Concept Alignment as a Prerequisite for Value Alignment". arXiv preprint. {{cite journal}}: Explicit use of et al. in: |author= (help)
^ Skalse, Tobias (2025). "Misspecification in IRL". AI Alignment Forum.
^ Zhou, Ke; Li, Min (2024). "Rethinking IRL: From Data Alignment to Task Alignment". {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)
^ Cheng, Wei; et al. (2025). "Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment". {{cite journal}}: Cite journal requires |journal= (help); Explicit use of et al. in: |author= (help)
^ Varys, L.; et al. (2025). "HAVA: Hybrid Approach to Value Alignment". {{cite journal}}: Cite journal requires |journal= (help); Explicit use of et al. in: |author= (help)
^ Oliveira, R.; et al. (2023). "Culturally-Attuned Moral Machines". {{cite journal}}: Cite journal requires |journal= (help); Explicit use of et al. in: |author= (help)
^ "Advances and Applications in Inverse Reinforcement Learning". 2025. {{cite journal}}: Cite journal requires |journal= (help)
^ "Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment". 2025. {{cite journal}}: Cite journal requires |journal= (help)
^ "HAVA: Hybrid Approach to Value Alignment". 2025. {{cite journal}}: Cite journal requires |journal= (help)