askvity

What is the PSI function in Python?

Published in Data Analysis 5 mins read

The term "PSI function" in Python is ambiguous. It most likely refers to the Population Stability Index (PSI), which is a metric used to quantify the shift in the distribution of a variable between two samples (typically an "expected" or "base" distribution and an "actual" or "current" distribution). It's not a built-in Python function but is typically implemented using libraries like NumPy and Pandas.

Here's a breakdown of what PSI is and how you can calculate it in Python:

Understanding the Population Stability Index (PSI)

PSI is commonly used in risk management, especially in credit scoring, to monitor the stability of model inputs over time. A significant shift in a variable's distribution could indicate changes in the population being modeled, potentially impacting the model's accuracy and reliability. Unstable features can arise due to policy changes, economic shifts (like a recession), or other external factors.

Calculating PSI in Python

Here's a Python example demonstrating how to calculate PSI:

import numpy as np
import pandas as pd

def calculate_psi(expected, actual, buckettype='bins', buckets=10, axis=0):
    """
    Calculates the Population Stability Index (PSI).

    Args:
        expected: A pandas series or numpy array representing the expected distribution.
        actual: A pandas series or numpy array representing the actual distribution.
        buckettype: Type of method to use for determining bucket ranges (bins/quantiles).
        buckets: Number of buckets to use.
        axis: Axis to perform calculations along.

    Returns:
        The PSI value.
    """

    def sub_calculate_psi(expected_distrib, actual_distrib, buckettype, buckets):

        if buckettype == 'bins':
            breakpoints = np.arange(0, buckets + 1) / (buckets) * 100
            quantiles = np.percentile(expected_distrib, breakpoints)
        elif buckettype == 'quantiles':
            quantiles = np.arange(0, buckets + 1) / (buckets)
            quantiles = expected_distrib.quantile(quantiles)
        else:
            raise ValueError('buckettype must be "bins" or "quantiles"')

        expected_counts = pd.cut(expected_distrib, quantiles, include_lowest=True).value_counts(sort=False)
        actual_counts = pd.cut(actual_distrib, quantiles, include_lowest=True).value_counts(sort=False)

        expected_props = expected_counts / len(expected_distrib)
        actual_props = actual_counts / len(actual_distrib)

        psi_value = np.sum((actual_props - expected_props) * np.log(actual_props / expected_props))

        return psi_value

    if isinstance(expected, pd.Series):
        expected = expected.values

    if isinstance(actual, pd.Series):
        actual = actual.values

    if len(expected.shape) == 1:
        psi_value = sub_calculate_psi(expected, actual, buckettype, buckets)

    else:
        psi_value = np.apply_along_axis(lambda x: sub_calculate_psi(expected[x], actual[x], buckettype, buckets), axis, np.arange(expected.shape[axis]))

    return psi_value


# Example usage:
expected_distribution = np.random.normal(0, 1, 1000) # Example: normally distributed expected data
actual_distribution = np.random.normal(0.2, 1, 1000)   # Example: slightly shifted actual data

psi = calculate_psi(pd.Series(expected_distribution), pd.Series(actual_distribution), buckets=10)
print(f"PSI Value: {psi}")

Explanation:

  1. Bucketing: The calculate_psi function first divides the expected distribution into a set of buckets (e.g., 10 equal-width bins or quantiles).

  2. Counting: It then counts the number of observations falling into each bucket for both the expected and actual distributions.

  3. Proportions: It calculates the proportion of observations in each bucket for both distributions.

  4. PSI Calculation: The PSI is calculated using the formula:

    PSI = Σ (Actual % - Expected %) * ln(Actual % / Expected %)

    where the summation is over all buckets. A small constant (e.g., 0.0001) might be added to both percentages to avoid division by zero or taking the logarithm of zero.

Interpreting PSI Values

PSI values are generally interpreted as follows:

  • PSI < 0.1: No significant change.
  • 0.1 <= PSI < 0.2: Small shift in distribution.
  • PSI >= 0.2: Significant shift in distribution, requiring investigation.

These thresholds are guidelines, and the specific interpretation might vary depending on the context and the specific application.

Considerations

  • Number of Buckets: The choice of the number of buckets can influence the PSI value. Too few buckets might mask important shifts, while too many buckets might lead to unstable results. A common starting point is 10-20 buckets.
  • Bucket Type: Using bins creates equal-width buckets, whereas using quantiles creates buckets containing approximately the same number of observations in the expected distribution. Quantiles are generally preferred if the underlying distribution is skewed.
  • Missing Values: Handle missing values appropriately before calculating PSI (e.g., imputation or removal).

In conclusion, the PSI function, as implemented in Python, is a valuable tool for monitoring the stability of variables used in models, particularly in risk management and credit scoring. It helps identify shifts in distributions that could impact model performance. The provided Python code demonstrates a practical implementation of PSI calculation using common data science libraries.

Related Articles