There are two types of supervised learning algorithms, regression and classification. Classification problems require categorical or discrete response variables (y variable). If you try to train a scikit-learn imported classification model with a continuous variable, you will encounter the error ValueError: Unknown label type: ‘continuous’.

To solve this error, you can encode the continuous y variable into categories using Scikit-learn’s preprocessing.LabelEncoder or if it is a regression problem use a regression model suitable for the data.

This tutorial will go through the error in detail and how to solve it with code examples.


ValueError: Unknown label type: ‘continuous’

In Python, a value is a piece of information stored within a particular object. You will encounter a ValueError in Python when you use a built-in operation or function that receives an argument with the right type but an inappropriate value. In this case, the y variable data has continuous values instead of discrete or categorical values.

What Does Continuous Mean?

There are two categories of data:

  • Discrete data: categorical data, for example, True/False, Pass/Fail, 0/1 or count data, for example, number of students in a class.
  • Continuous data: Data that we can measure on an infinite scale; it can take any value between two numbers, no matter how small. For example, the length of a string can be 1.00245 centimetres.

However, you cannot have 1.5 of a student in a class; count is a discrete measure. Measures of time, height, and temperature are all examples of continuous data.

What is the Difference Between Regression and Classification?

We can classify supervised learning algorithms into two types: Regression and Classification. For regression, the response variable or label is continuous, for example, weight, height, price, or time. In each case, a regression model seeks to predict a continuous quantity.

For classification, the response variable or label is categorical, for example, Pass or Fail, True or False. A classification model seeks to predict a class label.

Example #1: Evaluating the Data

Let’s look at an example of training a Logistic Regression model to perform classification on arrays of integers. First, let’s look at the data. We will import numpy to create our explanatory variable data X and our response variable data y. Note that the data used here has no real relationship and is only for explaining purposes.

import numpy as np

# Values for Predictor and Response variables
X = np.array([[2, 4, 1, 7], [3, 5, 9, 1], [5, 7, 1, 2], [7, 4, 2, 8], [4, 2, 3, 8]])
y = np.array([0, 1.02, 1.02, 0, 0])

Next, we will import the LogisticRegression class and create an object of this class, our logistic regression model. We will then fit the model using the values for the predictor and response variables.

from sklearn.linear_model import LogisticRegression

# Attempt to fit Logistic Regression Model
cls = LogisticRegression()
cls.fit(X, y)

Let’s run the code to see what happens:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-556cca8758bd> in <module>
      3 # Attempt to fit Logistic Regression Model
      4 cls = LogisticRegression()
----> 5 cls.fit(X, y)

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py in fit(self, X, y, sample_weight)
   1514             accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
   1515         )
-> 1516         check_classification_targets(y)
   1517         self.classes_ = np.unique(y)
   1518 

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
    195         "multilabel-sequences",
    196     ]:
--> 197         raise ValueError("Unknown label type: %r" % y_type)
    198 
    199 

ValueError: Unknown label type: 'continuous'

The error occurs because logistic regression is a classification problem that requires the values of the response variable to be categorical or discrete such as: “Yes” or “No”, “True” or “False”, 0 or 1. In the above code, our response variable values contain continuous values 1.02.

Solution

To solve this error, we can convert the continuous values of the response variable y to categorical values using the LabelEncoder class under sklearn.preprocessing. Let’s look at the revised code:

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

# Values for Predictor and Response variables
X = np.array([[2, 4, 1, 7], [3, 5, 9, 1], [5, 7, 1, 2], [7, 4, 2, 8], [4, 2, 3, 8]])

y = np.array([0, 1.02, 1.02, 0, 0])

# Create label encoder object
labels = preprocessing.LabelEncoder()

# Convert continous y values to categorical
y_cat = labels.fit_transform(y)

print(y_cat)
[0 1 1 0 0]

We have encoded the original values as 0 or 1. Now, we can fit the logistic regression model and perform a prediction on test data:

# Attempt to fit Logistic Regression Model
cls = LogisticRegression()
cls.fit(X, y_cat)

X_pred = np.array([5, 6, 9, 1])

X_pred = X_pred.reshape(1, -1)

y_pred = cls.predict(X_pred)

print(y_pred)

Let’s run the code to get the result:

[1]

We successfully fit the model and used it to predict unseen data.

Example #2: Evaluating the Model

Let’s look at an example where we want to train a k-Nearest Neighbours classifier to fit on some data. The data, which we will store in a file called regression_data.csv looks like this:

Avg.Session Length,TimeonApp,TimeonWebsite,LengthofMembership,Yearly Amount Spent
34.497268,12.655651,39.577668,4.082621,587.951054
31.926272,11.109461,37.268959,2.664034,392.204933
33.000915,11.330278,37.110597,4.104543,487.547505
34.305557,13.717514,36.721283,3.120179,581.852344
33.330673,12.795189,37.536653,4.446308,599.406092
33.871038,12.026925,34.476878,5.493507,637.102448
32.021596,11.366348,36.683776,4.685017,521.572175

Next, we will import the data into a DataFrame. We will define four columns as the explanatory variables and the last column as the response variable. Then, we will split the data into training and test data:

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv('regression_data.csv')

X = df[['Avg.Session Length', 'TimeonApp','TimeonWebsite', 'LengthofMembership']]

y = df['Yearly Amount Spent']

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Next, we will define a KNeighborsClassifier model and fit to the data:

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)

Let’s run the code to see what happens:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-889312abc571> in <module>
----> 1 knn.fit(X_train,y_train)

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/neighbors/_classification.py in fit(self, X, y)
    196         self.weights = _check_weights(self.weights)
    197 
--> 198         return self._fit(X, y)
    199 
    200     def predict(self, X):

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/neighbors/_base.py in _fit(self, X, y)
    418                     self.outputs_2d_ = True
    419 
--> 420                 check_classification_targets(y)
    421                 self.classes_ = []
    422                 self._y = np.empty(y.shape, dtype=int)

~/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/multiclass.py in check_classification_targets(y)
    195         "multilabel-sequences",
    196     ]:
--> 197         raise ValueError("Unknown label type: %r" % y_type)
    198 
    199 

ValueError: Unknown label type: 'continuous'

The error occurs because the k-nearest neighbors classifier is a classification algorithm and therefore requires categorical data for the response variable. The data we provide in the df['Yearly Amount Spent'] series is continuous.

Solution

We can interpret this problem as a regression problem, not a classification problem because the response variable is continuous and it is not intuitive to encode “Length of membership” into categories. We need to use the regression algorithm KNeighborsRegressor instead of KNeighborsClassifier to solve this error. Let’s look at the revised code:

from sklearn.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=1)

knn.fit(X_train,y_train)

Once we have fit to the data we can get our predictions with the test data.

y_pred = knn.predict(X_test)
print(y_pred)

Let’s run the code to see the result:

[599.406092 487.547505 521.572175]

We successfully predicted three “Yearly Amount Spent” values for the test data.

Summary

Congratulations on reading to the end of this tutorial! The ValueError: Unknown label type: ‘continuous’ occurs when you try to use continuous values for your response variable in a classification problem. Classification requires categorical or discrete values of the response variable. To solve this error, you can re-evaluate the response variable data and encode it to categorical. Alternatively, you can re-evaluate the model and use a regression model instead of a classification model.

Although “regression” is in the name, logistic regression is a classification algorithm that attempts to classify observations from a dataset into discrete categories. Whenever you want to perform logistic regression, ensure the response variable data is categorical.

For further reading on Scikit-learn, go to the article: How to Solve Python ValueError: input contains nan, infinity or a value too large for dtype(‘float64’).

Go to the online courses page on Python to learn more about coding in Python for data science and machine learning.

Have fun and happy researching!