Mathematical Insights into Temperature Scaling and Hallucination in Large Language Models

Danny Butvinik
13 min readApr 3, 2024

Part 1 of the Series: Analyzing Hallucinations: From Attention Mechanisms to Error Propagation in LLMs

Hallucination occurs when models generate outputs that neither align with factual information nor follow logical coherence. This issue represents a significant challenge in generative AI and Large Language Models (LLMs) as it navigates the fine line between groundbreaking innovation and potential error.

This study divides hallucinations into factual (intrinsic and extrinsic) and non-factual groups. It also discusses "Silver Lining" hallucinations in more detail to examine the complicated dynamics of temperature scaling and its huge effect on neural network fidelity.

Silver lining refers to a positive, beneficial, or creatively valuable aspect of otherwise inaccurate or illogical model outputs, known as hallucinations, in LLMs. Unlike typical hallucinations, which are generally seen as errors or flaws due to their deviation from factual accuracy or logical coherence, “silver lining” hallucinations encompass outputs that, while not entirely accurate or logical, offer imaginative, insightful, or novel perspectives that can be seen as advantageous or enriching in certain contexts. These outputs demonstrate the model’s ability to generate creative, metaphorical, or thought-provoking content, potentially sparking new ideas or offering unique viewpoints, thereby adding value despite not strictly adhering to reality or logic.

Factual Intrinsic Hallucination: An LLM claims, “The Eiffel Tower is located in Rome.” This reflects a factual error in the model’s internal knowledge representation.

Factual Extrinsic Hallucination: An LLM, sourced from outdated external databases, states, “The current President of the United States is John Doe.” This attribute of the presidency to a fictional character is due to the reliance on incorrect external information.

Non-Factual Intrinsic Hallucination: The model constructs a narrative where “a dog enrolls in and attends high school,” an imaginative yet logically incoherent scenario that underscores the model’s internal logic misfiring.

Non-Factual Extrinsic Hallucination: In response to a query about climate change, an LLM unexpectedly shifts the topic, commenting, “Let’s explore the evolution of French cuisine,” illustrating how unrelated external prompts can lead to illogical and irrelevant outputs.

Silver Lining Intrinsic Hallucination: The LLM offers a creative perspective, stating, “Quantum computing is akin to weaving the fabric of the universe,” a metaphorically rich but scientifically inaccurate reflection of its creative reasoning.

Extrinsic hallucination with a silver lining: An LLM creates a story about "ancient gods traveling the galaxy in their celestial ships" using a combination of accurate astronomical knowledge and imaginative narrative influenced by the various inputs.

Figure 1: Spectrum of Hallucination Types in LLMs.

These examples show that hallucinations in LLMs can take many forms. They also lay the groundwork for our study of how temperature scaling affects these effects, balancing creativity, accuracy, and error propagation in neural networks.

Note: Formulas, equations, and notations in this article are original works by the author.

The Mathematical Basis of Temperature Scaling

The nuanced mechanism of temperature scaling is a double-edged sword within the architecture of LLMs. It subtly influences the gradient flow and error propagation through the softmax function — the crux of decision-making in neural networks. At its core, temperature scaling modifies the softmax function, transforming logits (z) into a probability distribution over possible output. Mathematically, the softmax function at temperature is defined as:

Eq. 1: Temperature-scaled softmax function

where p_i (T) is the probability of the i-th class, z_i is the logit corresponding to the class i, and K is the total number of classes.

Note 1: A logit z is the raw output of the last layer of a neural network before the softmax transformation is applied. This is especially true when discussing softmax functions and temperature scaling in models like LLMs (Large Language Models). These logits are essentially the unnormalized predictions a model generates from its last linear layer. They can be understood as scores (or logits) that reflect the model’s preference for each class or output before being converted into probabilities.

For a given input x, the logit z_i corresponding to class i can be calculated as:

Eq. 2: The formula calculates the logit for class i based on the input x. By Author.

where

  • w_i represents the weight vector associated with class i
  • x is the input feature vector
  • b_i is the bias term for class i
  • z_i is the logit for class i

Note 2: K is a vocabulary size where each word or token is a class.

Temperature scaling serves as a dial controlling the model’s exploratory behavior. Higher temperatures promote exploration, enabling the model to generate more diverse and novel outputs. Conversely, lower temperatures bias the model towards exploitation, focusing on the most probable or previously seen patterns. This dynamic is crucial for tailoring model outputs to specific tasks where the need for novelty or accuracy varies.

Temperature variations influence the diversity of model outputs and the propagation of errors within the model’s layers. Higher temperatures, by encouraging exploration of less likely options, can introduce noise and potential errors in the learning process, affecting the model’s ability to learn and replicate patterns accurately. Lower temperatures may reduce these errors but at the cost of model flexibility and adaptability.

Given the temperature-scaled softmax function, let’s compute the partial derivative of p_i(T) with respect to T, which indicates how the probability of selecting class i changes as T varies:

Eq. 03: The partial derivative of p_i(T) with respect to T measures how the probability of selecting class i changes as the temperature T varies

Applying the quotient rule for differentiation, we have:

Eq. 04: The functions u(T) and v(T) are defined as applying the quotient rule for differentiation.

Derivative of u(T) with respect to T:

Eq. 05: To find the rate of change of u(T) concerning T, we compute the derivative of e^z_i/T with respect to T, where z_i represents the logit corresponding to class i.

Using the chain rule, this derivative simplifies to:

Eq. 06: Employing the chain rule, the function's derivative with respect to T simplifies to a negative fraction, where z_i represents the logit corresponding to class i.

Derivative of v(T) with respect to T:

Eq. 07: To compute the derivative with respect to T for the given summation, we differentiate the exponential terms, where z_j​ represents the values associated with each class.

Applying the derivative inside the sum yields:

Eq. 08: After applying the derivative inside the summation, the expression for the derivative with respect to T becomes a complex quotient involving various terms and their derivatives.

Substituting into v(T) and u(t) (Eq. 04)

Eq. 09: After substituting into v(T) and u(T).

Simplifying, we get:

Eq. 10: Simplifying further, the derivative with respect to T can be expressed as a product of p_i​(T) and a bracketed expression involving various terms.

This expression shows how the probability of the i-th class changes with temperature. It also shows how temperature affects the sensitivity of the softmax output and, as a result, how model predictions are changed.

Using these deductions, we learn more about the math behind temperature scaling, which gives us a precise way to control how neural network models explore and exploit their surroundings.

Notation: The quotient rule is a method used in calculus to find the derivative of a function, that is, the quotient (division) of two differentiable functions. It is especially useful when you’re dealing with ratios of functions and need to compute the rate of change or slope of that ratio.

Given a function:

Eq. 10
Eq. 11

where both g(x) and h(x) are differentiable functions, the quotient rule states that the derivative of f(x) for x is:

Figure 2: This graph demonstrates how altering the softmax function's temperature parameter (T) influences the probability of selecting a specific class. It highlights the mechanism behind temperature scaling’s role in modulating a model’s exploratory behavior and output diversity.

Negative (T) values are not typically considered in the context of temperature scaling for softmax functions. This is because the operation involves dividing by (T), and negative temperatures would invert the relationships between logits in a way that doesn’t align with the intended use of temperature scaling—to adjust between sharpening and softening the probability distribution smoothly.

Mathematically, while you can compute values for negative temperatures, they do not have a meaningful interpretation in the standard application of temperature scaling for LLMs or neural networks.

Eq. 12: The formula used in the visualization simplifies the scenario to demonstrate the effect of temperature scaling on a single class’s probability in a hypothetical situation with only three classes.

In contrast, the article discusses temperature scaling in a more general context, applicable to any number of classes (K) in LLMs. The visualization aims to provide an intuitive understanding of how temperature affects softmax probabilities rather than precisely modeling a specific LLM scenario.

Interpreting the Derivative

This derivative shows how the chance of picking any given token i changes in a complex way depending on (T). It gives us a mathematical way to examine the model's exploratory dynamics.

The interpretation of these derivatives sheds light on a critical aspect of neural network behavior under temperature scaling. The fact that high z_i values have less of an impact on the probability distribution as (T) increases, making the probability distribution "softer," mathematically supports the model's tendency to consider less likely options. The model is more likely to go beyond the most likely outcomes when there is less difference between the logits. This encourages exploration, which makes the generated content more creative and varied.

Conversely, the model’s behavior becomes more conservative as T decreases, accentuating the differences among z_i values. It gravitates towards the most likely outcomes with increased confidence. However, this mechanism, pivotal for tasks requiring high precision and reliability, curtails the model’s exploratory behavior, potentially stifling creativity.

In essence, the mathematical elucidation of temperature scaling’s effect on the softmax function underscores the profound influence of (T) on the model’s output dynamics. By changing (T), you can change the model's tendency to look at less likely options. This feature makes LLMs more creative and new, but it needs to be carefully calibrated to ensure a good balance between exploring and exploitation.

We’re figuring out more about how neural networks function and using this math to help LLMs create what we want, whether exploring new ideas or being accurate.

Exploration Increase as T Rises: The negative sign in front of z_i/T² suggests that, for large z_i (highly probable options), the increase in (T) diminishes their relative probability incrementally, leveling the playing field for less likely classes. This encourages exploration by softening the dominance of the most probable options.

Sharpening Focus as T Decreases: Conversely, as T decreases, the model’s focus sharpens on the most likely options, reinforcing exploitation. This is because the probability distribution becomes steeper, making high z_i values (more probable options) more dominant.

Balanced Behavior at T = 1: When (T) is set to 1, the model exhibits a balanced behavior, neither too exploratory nor too exploitative, reflecting the original softmax distribution without temperature scaling.

This mathematical insight into the derivative of p_i (T) with respect to T shows how temperature affects the exploration-exploitation dynamics of LLMs. By strategically adjusting (T), practitioners can steer the model towards generating more diverse, creative outputs or prioritizing coherence and factual accuracy, thereby tailoring the model’s performance to specific application requirements.

Quantifying Hallucination Risks through Entropy

Entropy quantifies the unpredictability or randomness of the model’s predictions in the context of information theory applied to LLM outputs. For a given temperature-scaled softmax output distribution p(T), the entropy H(T) is mathematically defined as:

Eq. 12: Entropy, H(T) quantifies the unpredictability or randomness of the model’s predictions within the framework of information theory, particularly when applied to LLM outputs.

where p_i(T) represents the probability of the i-th class (or token) at temperature T, and K is the total number of possible classes (or tokens in the vocabulary).

Temperature’s Impact on Entropy:

  • As (T) increases (T > 1), the softmax function’s output distribution becomes more uniform, effectively increasing H(T). This heightened entropy signifies a broader exploration of the vocabulary space, enhancing diversity but also elevating the risk of incorporating less relevant or incorrect tokens—hallucinations.
  • Conversely, as (T) decreases (T < 1), the distribution focuses more sharply on a subset of highly probable tokens, reducing H(T). Lower entropy indicates a constrained, more predictable output that may prioritize accuracy but at the potential cost of creative diversity.
  • At T = 1, the entropy reflects the model’s intrinsic distribution based on its training and serves as a baseline for comparison.

Mathematical Exploration of Hallucination Risks

Understanding and managing the risk of hallucination in large language models is crucial for enhancing their reliability and utility across various applications. By establishing a mathematical model that links the entropy of the output distribution—a measure of diversity—to the probability of hallucination, we can begin to quantify and thereby strategically mitigate this risk. This section explores a theoretical framework for this purpose and examines the implications of adjusting temperature (T) as a mechanism for controlling hallucinatory outputs.

Given the entropy formula, we can propose a hypothetical model to link entropy with the probability of hallucination, P_hallucination:

Eq. 13: We propose a hypothetical model to link entropy with the probability of hallucination, where α and β are coefficients.

where α and β are parameters that could be determined through empirical analysis, capturing the sensitivity of hallucination risks to changes in entropy, this model suggests that as entropy increases (implying greater output diversity and less predictability), so does the likelihood of the model generating hallucinations.

Empirical Validation and Calibration:

  • Data Collection: This involves gathering model outputs at different temperature (T) settings and annotating them for instances of hallucinated content. This step is crucial for building a dataset that can be used to analyze how changes in (T) affect the frequency and types of hallucinations.
  • Parameter Estimation: Here, regression analysis is employed to estimate the parameters (α and β) of a model that predicts the probability of hallucination based on entropy. This process fits the model to the observed relationship between entropy levels (which vary with T) and the incidence of hallucinations, providing a quantitative tool for understanding and predicting hallucination risks.

Integrating Temperature Scaling Strategies for LLM Deployment

Optimizing Temperature for Balanced Model Outputs: This introductory statement underscores the importance of selecting an optimal temperature (T) to achieve a desired balance in model outputs, catering to specific deployment goals.

  • Defining Application Goals: Before selecting (T), it’s essential to define the balance between novelty and accuracy desired for the application. This bullet suggests that higher (T) values are suited for tasks requiring creativity (like story generation), while lower (T) values should be used for tasks where accuracy is paramount.
  • Empirical Benchmarking: This involves using quantitative and qualitative measures to assess the impact of different (T) settings on model performance. Quantitative metrics could include BLEU scores for creative tasks or precision and recall for tasks requiring factual accuracy. Qualitative assessments might involve user surveys or expert reviews, providing a comprehensive view of how well different (T) settings meet the application’s objectives.

Adaptive Temperature Control and User Feedback

Implementing dynamic (T) adjustment mechanisms enables the model to respond to varying contexts and user interactions, optimizing output diversity and reliability in real-time. Machine learning algorithms, particularly reinforcement learning, can automate these adjustments by iterative refining (T) based on direct feedback and performance metrics.

  • Incorporating User Feedback: Direct user feedback and engagement metrics are invaluable for iteratively improving model outputs. Feedback loops facilitate continuous adaptation, aligning model behavior with evolving user expectations and application demands.
  • Extra Steps to Lower the Risk: To lower the risk of hallucinations even more, temperature scaling should be used with extra methods, such as post-processing filters or external knowledge validations, to make sure the content that is generated is relevant and correct.

Conclusion

This article has explored the critical role of temperature scaling in managing the output behavior of large language models (LLMs). By adjusting temperature, we can fine-tune a model’s balance between generating novel content and maintaining coherence and factual accuracy, directly impacting the occurrence of hallucinations. Through mathematical insights and practical strategies, we’ve highlighted how different temperature settings can encourage models to explore more diverse options or exploit known patterns more reliably.

Key takeaways include the importance of empirical benchmarking and user feedback in selecting an optimal temperature that aligns with the specific goals of LLM deployment. Additionally, adaptive temperature control is a vital strategy for dynamically adjusting to real-time feedback, optimizing the model’s performance across various tasks.

Ultimately, understanding and applying temperature scaling allows for more nuanced control over LLMs, ensuring their outputs meet the desired standards of creativity and accuracy and pave the way for their effective use in various applications.

NICE Actimize

Using innovative technology to protect institutions and safeguard consumers' and investors’ assets, NICE Actimize detects and prevents financial crimes and provides regulatory compliance. Artificial Intelligence (AI) and automation in scalable production have seen a significant surge within the financial crime domain, with NICE Actimize playing a pivotal role in driving this advancement. Aligned with its long-term vision of proactively preventing fraud through real-time automation in scalable production, Actimize aims to provide robust analytical capabilities in a time-sensitive manner. NICE Actimize recognizes the potential utilization of GenAI, our latest endeavor in harnessing the power of Generative AI and Large Language Models (LLMs), to address complex challenges, unlocking unique capabilities that complement our commitment to advancing financial crime prevention solutions.

Reference

[1]. Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., & Liu, T. (2024). A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232v1.

[2]. Islam, S. M. T., Zaman, S. M. M., Jain, V., Rani, A., Rawte, V., Chadha, A., & Das, A. (2024). A comprehensive survey of hallucination mitigation techniques in LLMs. arXiv preprint arXiv:2401.01313v3.

[3]. Rawte, V., Chakraborty, S., Pathak, A., Sarkar, A., Islam, S. M. T., Chadha, A., Sheth, A., & Das, A. (2024). Troubling emergence of hallucinations in large language models: an extensive definition, quantification, and prescriptive remediations. arXiv preprint arXiv:2310.04988v2.

[4]. Rawte, V., Sheth, A., & Das, A. (2023). A survey of hallucinations in “large” foundation models. arXiv preprint arXiv:2309.05922v1.

[5]. Wang, F. (2024). LightHouse: A Survey of AGI Hallucinations. arXiv preprint arXiv:2401.06792v2.

[6]. Wei, J., Yao, Y., Estornell, A., Ton, J.-F., Guo, H., & Liu, Y. (2024). Measuring and reducing LLM hallucination without gold-standard answers via expertise-weighting. arXiv preprint arXiv:2402.10412v1.

[7]. Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models—arXiv preprint arXiv:2401.11817v1.

[8]. Yu, L., Cao, M., Cheung, J. C. K., & Dong, Y. (2024). Mechanisms of non-factual hallucinations in language models. arXiv:2403.18167v1.

--

--