Calculate Cohen's Kappa using either two arrays of ratings or a confusion matrix.
Understanding Cohen's Kappa
💡 Cohen's Kappa (\(\kappa\)) is a statistical measure of inter-rater agreement for qualitative (categorical) items. It considers the observed agreement and the agreement expected by chance, providing a more robust measure compared to simple agreement percentages.
Formula for Cohen's Kappa
The formula for Cohen's Kappa is given by:
- \(P_o\): Observed agreement, the proportion of cases where both raters agree.
- \(P_e\): Expected agreement by chance, calculated based on the marginal totals of the rating categories.
Key Concepts
- Perfect Agreement (\(\kappa = 1\)): When the raters are in complete agreement.
- No Agreement (\(\kappa = 0\)): When the agreement is equivalent to chance.
- Negative Agreement (\(\kappa < 0\)): When the agreement is worse than chance.
Note: Cohen's Kappa assumes that the ratings are independent and the categories are mutually exclusive.
Interpreting Kappa Values
The following ranges are commonly used to interpret Cohen's Kappa values:
- \( > 0.80 \): Almost perfect agreement
- \( 0.60 - 0.80 \): Substantial agreement
- \( 0.40 - 0.60 \): Moderate agreement
- \( 0.20 - 0.40 \): Fair agreement
- \( < 0.20 \): Poor agreement
Step-by-Step Calculation of Cohen's Kappa
💡 Let's calculate Cohen's Kappa using the two arrays:
- Rater 1: [1, 2, 2, 3, 3, 4, 4, 5]
- Rater 2: [1, 2, 3, 3, 3, 4, 4, 5]
Step 1: Create the Confusion Matrix
The confusion matrix counts how often each pair of ratings occurred. For the given arrays:
Rater 1 / Rater 2 | 1 | 2 | 3 | 4 | 5 | Total |
---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 1 | 1 | 0 | 0 | 2 |
3 | 0 | 0 | 2 | 0 | 0 | 2 |
4 | 0 | 0 | 0 | 2 | 0 | 2 |
5 | 0 | 0 | 0 | 0 | 1 | 1 |
Total | 1 | 1 | 3 | 2 | 1 | 8 |
Step 2: Calculate Observed Agreement (\(P_o\))
The observed agreement is the proportion of times both raters agreed. This is the sum of the diagonal elements (agreements) divided by the total number of ratings.
- Diagonal elements: \(1 + 1 + 2 + 2 + 1 = 7\)
- Total ratings: \(8\)
- \[ P_o = \frac{\text{Diagonal Sum}}{\text{Total}} = \frac{7}{8} = 0.875 \]
Step 3: Calculate Expected Agreement (\(P_e\))
The expected agreement is calculated based on the marginal probabilities of each category (row and column totals).
- Category 1: \(P(1) = \frac{1}{8} \times \frac{1}{8} = 0.015625\)
- Category 2: \(P(2) = \frac{2}{8} \times \frac{1}{8} = 0.03125\)
- Category 3: \(P(3) = \frac{2}{8} \times \frac{3}{8} = 0.09375\)
- Category 4: \(P(4) = \frac{2}{8} \times \frac{2}{8} = 0.0625\)
- Category 5: \(P(5) = \frac{1}{8} \times \frac{1}{8} = 0.015625\)
- \[ P_e = 0.015625 + 0.03125 + 0.09375 + 0.0625 + 0.015625 = 0.21875 \]
Step 4: Calculate Cohen's Kappa
Using the formula: \[ \kappa = \frac{P_o - P_e}{1 - P_e} \] Substituting the values:
\[ \kappa = \frac{0.875 - 0.21875}{1 - 0.21875} = \frac{0.65625}{0.78125} \approx 0.84 \]Step 5: Interpret the Result
Based on the calculated Kappa value of \(0.84\), we can interpret the agreement as: Almost perfect agreement.
Real-Life Applications
Cohen's Kappa is widely used in various fields to assess inter-rater reliability:
- Healthcare: Assessing the consistency of diagnoses between doctors.
- Education: Measuring agreement in grading assignments or exams.
- Market Research: Evaluating the consistency of customer feedback classifications.
- Psychology: Determining agreement in categorizing behavioral observations.
Factors Affecting Cohen's Kappa
- Prevalence of Categories: Kappa is sensitive to imbalances in category frequencies.
- Number of Categories: More categories can reduce the likelihood of agreement by chance, increasing Kappa.
- Marginal Distributions: Unequal distributions of ratings between raters can affect Kappa.
Limitations of Cohen's Kappa
- Prevalence Paradox: High agreement on rare categories can lead to low Kappa values.
- Simplistic Assumptions: Assumes raters are equally reliable, which may not always be true.
- Category Independence: Assumes categories are mutually exclusive and exhaustive.
Reducing Bias in Kappa Calculations
To address potential biases and limitations:
- Ensure balanced categories to reduce the prevalence effect.
- Use weighted Kappa for ordinal data to account for the degree of disagreement.
- Conduct sensitivity analyses to explore the effect of category imbalances.
Python Implementation
from sklearn.metrics import cohen_kappa_score
# Example data: Ratings by two raters
rater1 = [1, 2, 2, 3, 3, 4, 4, 5]
rater2 = [1, 2, 3, 3, 3, 4, 4, 5]
# Calculate Cohen's Kappa
kappa = cohen_kappa_score(rater1, rater2)
print(f"Cohen's Kappa: {kappa:.5f}")
# Interpretation
if kappa > 0.8:
print("Almost perfect agreement.")
elif kappa > 0.6:
print("Substantial agreement.")
elif kappa > 0.4:
print("Moderate agreement.")
elif kappa > 0.2:
print("Fair agreement.")
else:
print("Poor agreement.")
from sklearn.metrics import cohen_kappa_score
import numpy as np
def calculate_kappa_from_matrix(conf_matrix):
"""
Calculate Cohen's Kappa from a confusion matrix.
Parameters:
conf_matrix (numpy.ndarray): Confusion matrix where rows represent Rater 1 categories
and columns represent Rater 2 categories.
Returns:
float: Cohen's Kappa score.
"""
# Total number of observations
total = np.sum(conf_matrix)
# Observed agreement (Po)
observed_agreement = np.trace(conf_matrix) / total
# Expected agreement (Pe)
row_totals = np.sum(conf_matrix, axis=1) / total
col_totals = np.sum(conf_matrix, axis=0) / total
expected_agreement = np.sum(row_totals * col_totals)
# Calculate kappa
kappa = (observed_agreement - expected_agreement) / (1 - expected_agreement)
return kappa
# Example confusion matrix
confusion_matrix = np.array([
[1, 0, 0, 0, 0], # Rater 1: Category 1
[0, 1, 1, 0, 0], # Rater 1: Category 2
[0, 0, 2, 0, 0], # Rater 1: Category 3
[0, 0, 0, 2, 0], # Rater 1: Category 4
[0, 0, 0, 0, 1] # Rater 1: Category 5
])
# Calculate Cohen's Kappa
kappa_score = calculate_kappa_from_matrix(confusion_matrix)
print(f"Cohen's Kappa: {kappa_score:.5f}")
# Interpretation
if kappa > 0.8:
print("Almost perfect agreement.")
elif kappa > 0.6:
print("Substantial agreement.")
elif kappa > 0.4:
print("Moderate agreement.")
elif kappa > 0.2:
print("Fair agreement.")
else:
print("Poor agreement.")
R Implementation
# Function to calculate Cohen's Kappa
calculate_kappa <- function(observed_matrix) {
# Calculate total observations
total <- sum(observed_matrix)
# Calculate row and column marginals (proportions)
row_totals <- rowSums(observed_matrix) / total
col_totals <- colSums(observed_matrix) / total
# Calculate observed agreement
observed_agreement <- sum(diag(observed_matrix)) / total
# Calculate expected agreement
expected_agreement <- sum(row_totals * col_totals)
# Calculate Cohen's Kappa
kappa <- (observed_agreement - expected_agreement) / (1 - expected_agreement)
return(kappa)
}
# Example usage
# Confusion matrix (rows = Rater 1, columns = Rater 2)
# Example usage
# Confusion matrix (rows = Rater 1, columns = Rater 2)
observed_matrix <- matrix(c(
1, 0, 0, 0, 0, # Rater 1: Category 1
0, 1, 1, 0, 0, # Rater 1: Category 2
0, 0, 2, 0, 0, # Rater 1: Category 3
0, 0, 0, 2, 0, # Rater 1: Category 4
0, 0, 0, 0, 1 # Rater 1: Category 5
), nrow = 5, byrow = TRUE)
# Calculate Cohen's Kappa
kappa <- calculate_kappa(observed_matrix)
cat(sprintf("Cohen's Kappa: %.5f\n", kappa))
# Interpretation
if (kappa > 0.8) {
cat("Almost perfect agreement.\n")
} else if (kappa > 0.6) {
cat("Substantial agreement.\n")
} else if (kappa > 0.4) {
cat("Moderate agreement.\n")
} else if (kappa > 0.2) {
cat("Fair agreement.\n")
} else {
cat("Poor agreement.\n")
}
# Function to calculate Cohen's Kappa from two rater arrays
calculate_kappa <- function(rater1, rater2) {
# Check if the input arrays are of the same length
if (length(rater1) != length(rater2)) {
stop("The two arrays must have the same length.")
}
# Generate the confusion matrix
confusion_matrix <- table(rater1, rater2)
# Total number of observations
total <- sum(confusion_matrix)
# Observed agreement (Po)
observed_agreement <- sum(diag(confusion_matrix)) / total
# Expected agreement (Pe)
row_totals <- rowSums(confusion_matrix) / total
col_totals <- colSums(confusion_matrix) / total
expected_agreement <- sum(row_totals * col_totals)
# Calculate Cohen's Kappa
kappa <- (observed_agreement - expected_agreement) / (1 - expected_agreement)
return(kappa)
}
# Example data: Ratings by two raters
rater1 <- c(1, 2, 2, 3, 3, 4, 4, 5)
rater2 <- c(1, 2, 3, 3, 3, 4, 4, 5)
# Calculate Cohen's Kappa
kappa_score <- calculate_kappa(rater1, rater2)
cat(sprintf("Cohen's Kappa: %.5f\n", kappa_score))
# Interpretation
if (kappa > 0.8) {
cat("Almost perfect agreement.\n")
} else if (kappa > 0.6) {
cat("Substantial agreement.\n")
} else if (kappa > 0.4) {
cat("Moderate agreement.\n")
} else if (kappa > 0.2) {
cat("Fair agreement.\n")
} else {
cat("Poor agreement.\n")
}
JavaScript Implementation
function calculateKappa(matrix) {
const total = matrix.flat().reduce((sum, val) => sum + val, 0);
// Calculate row and column marginals (proportions)
const rowTotals = matrix.map(row => row.reduce((sum, val) => sum + val, 0) / total);
const colTotals = matrix[0].map((_, colIndex) =>
matrix.reduce((sum, row) => sum + row[colIndex], 0) / total
);
// Calculate observed agreement
const observedAgreement = matrix.reduce(
(sum, row, rowIndex) => sum + row[rowIndex] / total,
0
);
// Calculate expected agreement
const expectedAgreement = rowTotals.reduce(
(sum, rowProp, rowIndex) => sum + rowProp * colTotals[rowIndex],
0
);
// Calculate Cohen's Kappa
return (observedAgreement - expectedAgreement) / (1 - expectedAgreement);
}
// Example usage
const observedMatrix = [
[1, 0, 0, 0, 0], // Rater 1: Category 1
[0, 1, 1, 0, 0], // Rater 1: Category 2
[0, 0, 2, 0, 0], // Rater 1: Category 3
[0, 0, 0, 2, 0], // Rater 1: Category 4
[0, 0, 0, 0, 1] // Rater 1: Category 5
];
const kappa = calculateKappa(observedMatrix);
console.log(`Cohen's Kappa: ${kappa.toFixed(5)}`);
// Interpretation
if (kappa > 0.8) {
console.log("Almost perfect agreement.");
} else if (kappa > 0.6) {
console.log("Substantial agreement.");
} else if (kappa > 0.4) {
console.log("Moderate agreement.");
} else if (kappa > 0.2) {
console.log("Fair agreement.");
} else {
console.log("Poor agreement.");
}
/**
* Calculate Cohen's Kappa from two arrays of ratings.
* @param {Array} rater1 - Ratings by Rater 1.
* @param {Array} rater2 - Ratings by Rater 2.
* @returns {number} - Cohen's Kappa score.
*/
function calculateKappa(rater1, rater2) {
if (rater1.length !== rater2.length) {
throw new Error("The two arrays must have the same length.");
}
// Generate the confusion matrix
const uniqueLabels = Array.from(new Set(rater1.concat(rater2))).sort();
const matrix = Array(uniqueLabels.length).fill(0).map(() => Array(uniqueLabels.length).fill(0));
const labelIndex = Object.fromEntries(uniqueLabels.map((label, i) => [label, i]));
rater1.forEach((label, i) => {
matrix[labelIndex[label]][labelIndex[rater2[i]]]++;
});
// Total number of observations
const total = rater1.length;
// Observed agreement (Po)
const observedAgreement = matrix.reduce((sum, row, i) => sum + row[i], 0) / total;
// Expected agreement (Pe)
const rowTotals = matrix.map(row => row.reduce((sum, val) => sum + val, 0) / total);
const colTotals = matrix[0].map((_, colIndex) =>
matrix.reduce((sum, row) => sum + row[colIndex], 0) / total
);
const expectedAgreement = rowTotals.reduce(
(sum, rowProp, i) => sum + rowProp * colTotals[i], 0
);
// Calculate Cohen's Kappa
const kappa = (observedAgreement - expectedAgreement) / (1 - expectedAgreement);
return kappa;
}
// Example data: Ratings by two raters
const rater1 = [1, 2, 2, 3, 3, 4, 4, 5];
const rater2 = [1, 2, 3, 3, 3, 4, 4, 5];
// Calculate Cohen's Kappa
const kappaScore = calculateKappa(rater1, rater2);
console.log(`Cohen's Kappa: ${kappaScore.toFixed(5)}`);
// Interpretation
if (kappa > 0.8) {
console.log("Almost perfect agreement.");
} else if (kappa > 0.6) {
console.log("Substantial agreement.");
} else if (kappa > 0.4) {
console.log("Moderate agreement.");
} else if (kappa > 0.2) {
console.log("Fair agreement.");
} else {
console.log("Poor agreement.");
}
Further Reading
- Wikipedia: Cohen's Kappa – A detailed overview of the concept, formula, and examples.
- National Library of Medicine : Interrater reliability: the kappa statistic – explores Cohen's kappa as a robust measure of interrater reliability, addressing chance agreement and emphasizing its importance in research validity.
- The Research Scientist Pod Calculators – Explore a variety of statistical calculators, including Type II error calculators.
Attribution
If you found this guide helpful, feel free to link back to this post for attribution and share it with others!
Feel free to use these formats to reference our tools in your articles, blogs, or websites.
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.