The correlation ratio (often denoted by eta, η) is a measure of the correlation between a categorical column and a numeric column. It quantifies how well the mean of the numeric variable varies across the different categories of the categorical variable.
Based on the provided reference, the correlation ratio:
- Measures Correlation: It specifically assesses the strength of the relationship or association when you have one variable that is categorical (like city, gender, product type) and another that is numeric (like salary, age, sales amount).
- Focuses on Means: It measures the variance of the mean of the numeric column across different categories of the categorical column. In simpler terms, it looks at how much the average value of the numeric variable changes from one category to another.
Unlike measures like Pearson correlation, which assesses linear relationships between two numeric variables, the correlation ratio is designed for the unique scenario involving one categorical and one numeric variable.
Why Use the Correlation Ratio?
- Assessing Group Differences: It helps determine if the different categories of a nominal variable have significantly different average values for a metric variable. For example, does the average salary differ significantly between employees in different departments?
- Feature Selection: In machine learning or data analysis, it can be used to assess the predictive power of a categorical feature for a numeric target variable. A high correlation ratio indicates that the categorical feature is likely a good predictor of the numeric outcome.
- Understanding Variance: It essentially tells you what proportion of the total variance in the numeric variable can be explained by the differences between the group means defined by the categorical variable.
How It Works (Simplified)
Imagine you have data on employee salaries (numeric) and their department (categorical). The correlation ratio compares:
- Variance within groups: How much salaries vary within each department.
- Variance between groups: How much the average salary varies between different departments.
A high correlation ratio means that the variance between the group means is large compared to the variance within the groups. This suggests a strong relationship – the department you are in heavily influences your average salary.
Key Points
- The correlation ratio ranges from 0 to 1.
- A value of 0 indicates no relationship (the means of the numeric variable are roughly the same across all categories).
- A value of 1 indicates a perfect relationship (the numeric variable's value is fixed within each category, although it can differ between categories).
In essence, the correlation ratio provides a clear, single metric for the strength of the association between a nominal group variable and a quantitative variable, focusing on how the groups influence the mean of the quantitative variable.