There are two types of supervised learning algorithms: regression and classification. Classification problems require categorical or discrete response variables (y variable). If you try to train a scikit-learn imported classification model with a continuous variable, you will encounter the error ValueError: Unknown label type: ‘continuous’.
To solve this error, you can encode the continuous y variable into categories using Scikit-learn’s preprocessing.LabelEncoder
or if it is a regression problem use a regression model suitable for the data.
This tutorial will detail the error and how to solve it with code examples.
Table of contents
ValueError: Unknown label type: ‘continuous’
In Python, a value is information stored within a particular object. You will encounter a ValueError in Python when you use a built-in operation or function that receives an argument with the right type but an inappropriate value. In this case, the y variable data has continuous values instead of discrete or categorical values.
What Does Continuous Mean?
There are two categories of data:
- Discrete data: categorical data, such as True/False, Pass/Fail, or 0/1, or count data, such as the number of students in a class.
- Continuous data: Data that we can measure on an infinite scale; it can take any value between two numbers, no matter how small. For example, the length of a string can be 1.00245 centimetres.
However, you cannot have 1.5 of a student in a class; count is a discrete measure. Measures of time, height, and temperature are all examples of continuous data.
What is the Difference Between Regression and Classification?
We can classify supervised learning algorithms into two types: Regression and Classification. For regression, the response variable or label is continuous, for example, weight, height, price, or time. In each case, a regression model seeks to predict a continuous quantity.
For classification, the response variable or label is categorical, for example, Pass or Fail, True or False. A classification model seeks to predict a class label.
Example #1: Evaluating the Data
Let’s look at an example of training a Logistic Regression model to perform classification on arrays of integers. First, let’s look at the data. We will import numpy to create our explanatory variable data X and response variable y. Note that the data used here has no real relationship and is only for explaining purposes.
import numpy as np # Values for Predictor and Response variables X = np.array([[2, 4, 1, 7], [3, 5, 9, 1], [5, 7, 1, 2], [7, 4, 2, 8], [4, 2, 3, 8]]) y = np.array([0, 1.02, 1.02, 0, 0])
Next, we will import the LogisticRegression class and create an object of this class, our logistic regression model. We will then fit the model using the values for the predictor and response variables.
from sklearn.linear_model import LogisticRegression # Attempt to fit Logistic Regression Model cls = LogisticRegression() cls.fit(X, y)
Let’s run the code to see what happens:
ValueError: Unknown label type: 'continuous'
The error occurs because logistic regression is a classification problem that requires the values of the response variable to be categorical or discrete such as: “Yes” or “No”, “True” or “False”, 0 or 1. In the above code, our response variable values contain continuous values 1.02
.
Solution
To solve this error, we can convert the continuous values of the response variable y to categorical values using the LabelEncoder
class under sklearn.preprocessing
. Let’s look at the revised code:
import numpy as np from sklearn.linear_model import LogisticRegression from sklearn import preprocessing # Values for Predictor and Response variables X = np.array([[2, 4, 1, 7], [3, 5, 9, 1], [5, 7, 1, 2], [7, 4, 2, 8], [4, 2, 3, 8]]) y = np.array([0, 1.02, 1.02, 0, 0]) # Create label encoder object labels = preprocessing.LabelEncoder() # Convert continous y values to categorical y_cat = labels.fit_transform(y) print(y_cat)
[0 1 1 0 0]
We have encoded the original values as 0 or 1. Now, we can fit the logistic regression model and perform a prediction on test data:
# Attempt to fit Logistic Regression Model cls = LogisticRegression() cls.fit(X, y_cat) X_pred = np.array([5, 6, 9, 1]) X_pred = X_pred.reshape(1, -1) y_pred = cls.predict(X_pred) print(y_pred)
Let’s run the code to get the result:
[1]
We successfully fit the model and used it to predict unseen data.
Example #2: Evaluating the Model
Let’s look at an example where we want to train a k-Nearest Neighbours classifier to fit on some data. The data, which we will store in a file called regression_data.csv
looks like this:
Avg.Session Length,TimeonApp,TimeonWebsite,LengthofMembership,Yearly Amount Spent 34.497268,12.655651,39.577668,4.082621,587.951054 31.926272,11.109461,37.268959,2.664034,392.204933 33.000915,11.330278,37.110597,4.104543,487.547505 34.305557,13.717514,36.721283,3.120179,581.852344 33.330673,12.795189,37.536653,4.446308,599.406092 33.871038,12.026925,34.476878,5.493507,637.102448 32.021596,11.366348,36.683776,4.685017,521.572175
Next, we will import the data into a DataFrame. We will define four columns as the explanatory variables and the last column as the response variable. Then, we will split the data into training and test data:
import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv('regression_data.csv') X = df[['Avg.Session Length', 'TimeonApp','TimeonWebsite', 'LengthofMembership']] y = df['Yearly Amount Spent'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Next, we will define a KNeighborsClassifier
model and fit to the data:
from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors=1) knn.fit(X_train,y_train)
Let’s run the code to see what happens:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-12-889312abc571> in <module> ----> 1 knn.fit(X_train,y_train) ~/opt/anaconda3/lib/python3.8/site-packages/sklearn/neighbors/_classification.py in fit(self, X, y) 196 self.weights = _check_weights(self.weights) 197 --> 198 return self._fit(X, y) 199 200 def predict(self, X): ~/opt/anaconda3/lib/python3.8/site-packages/sklearn/neighbors/_base.py in _fit(self, X, y) 418 self.outputs_2d_ = True 419 --> 420 check_classification_targets(y) 421 self.classes_ = [] 422 self._y = np.empty(y.shape, dtype=int) ~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y) 195 "multilabel-sequences", 196 ]: --> 197 raise ValueError("Unknown label type: %r" % y_type) 198 199 ValueError: Unknown label type: 'continuous'
The error occurs because the k-nearest neighbors classifier is a classification algorithm and therefore requires categorical data for the response variable. The data we provide in the df['Yearly Amount Spent']
series is continuous.
Solution
We can interpret this problem as a regression problem, not a classification problem because the response variable is continuous and it is not intuitive to encode “Length of membership” into categories. We need to use the regression algorithm KNeighborsRegressor
instead of KNeighborsClassifier
to solve this error. Let’s look at the revised code:
from sklearn.neighbors import KNeighborsRegressor knn = KNeighborsRegressor(n_neighbors=1) knn.fit(X_train,y_train)
Once we have fit to the data we can get our predictions with the test data.
y_pred = knn.predict(X_test) print(y_pred)
Let’s run the code to see the result:
[599.406092 487.547505 521.572175]
We successfully predicted three “Yearly Amount Spent
” values for the test data.
Summary
Congratulations on reading to the end of this tutorial! The ValueError: Unknown label type: ‘continuous’ occurs when you try to use continuous values for your response variable in a classification problem. Classification requires categorical or discrete values of the response variable. To solve this error, you can re-evaluate the response variable data and encode it to categorical. Alternatively, you can re-evaluate the model and use a regression model instead of a classification model.
Although “regression” is in the name, logistic regression is a classification algorithm that attempts to classify observations from a dataset into discrete categories. To perform logistic regression, ensure the response variable data is categorical.
For further reading on Scikit-learn, go to the article: How to Solve Python ValueError: input contains nan, infinity or a value too large for dtype(‘float64’).
- How to Solve ModuleNotFoundError: No module named ‘sklearn’
- How to Solve Python Modulenotfounderror: No module named ‘sklearn.datasets.samples_generator
Go to the online courses page on Python to learn more about coding in Python for data science and machine learning.
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.