When using a dataset for analysis, you must check your data to ensure it only contains finite numbers and no NaN values (Not a Number). If you try to pass a dataset that contains NaN or infinity values to a function for analysis, you will raise the error: ValueError: input contains nan, infinity or a value too large for dtype(‘float64’).
To solve this error, you can check your data set for NaN values using numpy.isnan()
and infinite values using numpy.isfinite()
. You can replace NaN values using nan_to_num()
if your data is in a numpy array or SciKit-Learn’s SimpleImputer.
This tutorial will go through the error in detail and how to solve it with the help of code examples.
Table of contents
Python ValueError: input contains nan, infinity or a value too large for dtype(‘float64’)
What is a ValueError?
In Python, a value is the information stored within a particular object. You will encounter a ValueError in Python when you use a built-in operation or function that receives an argument with the right type but an inappropriate value.
What is a NaN in Python?
In Python, a NaN stands for Not a Number and represents undefined entries and missing values in a dataset.
What is inf in Python?
Infinity in Python is a number that is greater than every other numeric value and can either be positive or negative. All arithmetic operations performed on an infinite value will produce an infinite number. Infinity is a float value; there is no way to represent infinity as an integer. We can use float()
to represent infinity as follows:
pos_inf=float('inf') neg_inf=-float('inf') print('Positive infinity: ', pos_inf) print('Negative infinity: ', neg_inf)
Positive infinity: inf Negative infinity: -inf
We can also use the math, decimal, sympy, and numpy modules to represent infinity in Python.
Let’s look at some examples where we want to clean our data of NaN and infinity values.
Example #1: Dataset with NaN Values
In this example, we will generate a dataset consisting of random numbers and then randomly populate the dataset with NaN values. We will try to cluster the values in the dataset using the AffinityPropagation in the Scikit-Learn library.
Note: The use of the AffinityPropagation to cluster on random data is just an example to demonstrate the source of the error. The function you are trying to use may be completely different to AffinityPropagation, but the data preprocessing described in this tutorial will still apply.
The data generation looks as follows:
# Import numpy and AffinityPropagation import numpy as np from sklearn.cluster import AffinityPropagation # Number of NaN values to put into data n = 4 data = np.random.randn(20) # Get random indices in the data index_nan = np.random.choice(data.size, n, replace=False) # Replace data with NaN data.ravel()[index_nan]=np.nan print(data)
Let’s look at the data:
[-0.0063374 -0.974195 0.94467842 0.38736788 0.84908087 nan 1.00582645 nan 1.87585201 -0.98264992 -1.64822932 1.24843544 0.88220504 -1.4204208 0.53238027 nan 0.83446561 nan -0.04655628 -1.09054183]
The data consists of twenty random values, four of which are NaN, and the rest are numerical values. Let’s try to fit the data using the AffinityPropagation()
class.
af= AffinityPropagation(random_state=5).fit([data])
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
We raise the error because the AffinityPropagation.fit()
cannot handle NaN, infinity or extremely large values. Our data contains NaN values, and we need to preprocess the data to replace them with suitable values.
Solution #1: using nan_to_num()
To check if a dataset contains NaN values, we can use the isnan()
function from NumPy. If we pair this function with any()
, we will check if there are any instances of NaN. We can replace the NaN values using the nan_to_num()
method. Let’s look at the code and the clean data:
print(np.any(np.isnan(data))) data = np.nan_to_num(data) print(data)
True [-0.0063374 -0.974195 0.94467842 0.38736788 0.84908087 0. 1.00582645 0. 1.87585201 -0.98264992 -1.64822932 1.24843544 0.88220504 -1.4204208 0.53238027 0. 0.83446561 0. -0.04655628 -1.09054183]
The np.any()
part of the code returns True because our dataset contains at least one NaN value. The clean data has zeros in place of the NaN values. Let’s fit on the clean data:
af= AffinityPropagation(random_state=5).fit([data])
This code will execute without any errors.
Solution #2: using SimpleImputer
Scikit-Learn provides a class for imputation called SimpleImputer. We can use the SimpleImputer to replace NaN values. To replace NaN values in a one-dimensional dataset, we need to set the strategy parameter in the SimpleImputer to constant. First, we will generate the data:
import numpy as np n = 4 data = np.random.randn(20) index_nan = np.random.choice(data.size, n, replace=False) data.ravel()[index_nan]=np.nan print(data)
The data looks like this:
[ 1.4325319 0.61439789 0.3614522 1.38531346 nan 0.6900916 0.50743745 0.48544145 nan nan 0.17253557 nan -1.05027802 0.09648188 1.15971533 0.29005307 2.35040023 0.44103513 -0.03235852 -0.78142219]
We can use the SimpleImputer class to fit and transform the data as follows:
from sklearn.impute import SimpleImputer imp_mean = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0) imputer = imp_mean.fit([data]) data = imputer.transform([data]) print(data)
The clean data looks like this:
[[ 1.4325319 0.61439789 0.3614522 1.38531346 0. 0.6900916 0.50743745 0.48544145 0. 0. 0.17253557 0. -1.05027802 0.09648188 1.15971533 0.29005307 2.35040023 0.44103513 -0.03235852 -0.78142219]]
And we can pass the clean data to the AffinityPropagation clustering method as follows:
af= AffinityPropagation(random_state=5).fit(data)
We can also use the SimpleImputer class on multi-dimensional data to replace NaN values using the mean along each column. We have to set the imputation strategy to “mean”, and using the mean is only valid for numeric data. Let’s look at an example of a 3×3 nested list that contains NaN values:
from sklearn.impute import SimpleImputer data = [[7, 2, np.nan], [4, np.nan, 6], [10, 5, 9]]
We can replace the NaN values as follows:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') imp_mean.fit(data) data = imp_mean.transform(data) print(data)
[[ 7. 2. 7.5] [ 4. 3.5 6. ] [10. 5. 9. ]]
We replaced the np.nan
values with the mean of the real numbers along the columns of the nested list. For example, in the third column, the real numbers are 6 and 9, so the mean is 7.5, which replaces the np.nan
value in the third column.
We can use the other imputation strategies media and most_frequent.
Example #2: Dataset with NaN and inf Values
This example will generate a dataset consisting of random numbers and then randomly populate the dataset with NaN and infinity values. We will try to cluster the values in the dataset using the AffinityPropagation in the Scikit-Learn library. The data generation looks as follows:
import numpy as np from sklearn.cluster import AffinityPropagation n = 4 data = np.random.randn(20) index_nan = np.random.choice(data.size, n, replace=False) index_inf = np.random.choice(data.size, n, replace=False) data.ravel()[index_nan]=np.nan data.ravel()[index_inf]=np.inf print(data)
[-0.76148741 inf 0.10339756 nan inf -0.75013509 1.2740893 nan -1.68682986 nan 0.57540185 -2.0435754 0.99287213 inf 0.5838198 inf -0.62896815 -0.45368201 0.49864775 -1.08881703]
The data consists of twenty random values, four of which are NaN, four are infinity, and the rest are numerical values. Let’s try to fit the data using the AffinityPropagation()
class.
af= AffinityPropagation(random_state=5).fit([data])
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
We raise the error because the dataset contains NaN values and infinity values.
Solution #1: Using nan_to_num
To check if a dataset contains NaN values, we can use the isnan()
function from NumPy. If we pair this function with any()
, we will check if there are any instances of NaN.
To check if a dataset contains infinite values, we can use the isfinite()
function from NumPy. If we pair this function with any()
, we will check if there are any instances of infinity.
We can replace the NaN and infinity values using the nan_to_num()
method. The method will set NaN values to zero and infinity values to a very large number. Let’s look at the code and the clean data:
print(np.any(np.isnan(data))) print(np.all(np.isfinite(data))) data = np.nan_to_num(data) print(data)
True False [-7.61487414e-001 1.79769313e+308 1.03397556e-001 0.00000000e+000 1.79769313e+308 -7.50135085e-001 1.27408930e+000 0.00000000e+000 -1.68682986e+000 0.00000000e+000 5.75401847e-001 -2.04357540e+000 9.92872128e-001 1.79769313e+308 5.83819800e-001 1.79769313e+308 -6.28968155e-001 -4.53682014e-001 4.98647752e-001 -1.08881703e+000]
We replaced the NaN values with zeroes and the infinity values with 1.79769313e+308
. We can fit on the clean data as follows:
af= AffinityPropagation(random_state=5).fit([data])
This code will execute without any errors. If we do not want to replace infinity with a very large number but with zero, we can convert the infinity values to NaN using:
data[data==np.inf] = np.nan
And then pass the data to the nan_to_num
method, converting all the NaN values to zeroes.
Solution #2: Using fillna()
We can use Pandas to convert our dataset to a DataFrame and replace the NaN and infinity values using the Pandas fillna()
method. First, let’s look at the data generation:
import numpy as np import pandas as pd from sklearn.cluster import AffinityPropagation n = 4 data = np.random.randn(20) index_nan = np.random.choice(data.size, n, replace=False) index_inf = np.random.choice(data.size, n, replace=False) data.ravel()[index_nan]=np.nan data.ravel()[index_inf]=np.inf print(data
[ 0.41339801 inf nan 0.7854321 0.23319745 nan 0.50342482 inf -0.82102161 -0.81934623 0.23176869 -0.61882322 0.12434801 -0.21218049 inf -1.54067848 nan 1.78086445 inf 0.4881174 ]
The data consists of twenty random values, four of which are NaN, four are infinity, and the rest are numerical values. We can convert the numpy array to a DataFrame as follows:
df = pd.DataFrame(data)
Once we have the DataFrame, we can use the replace method to replace the infinity values with NaN values. Then, we will call the fillna()
method to replace all NaN values in the DataFrame.
df.replace([np.inf, -np.inf], np.nan, inplace=True) df = df.fillna(0)
We can use the to_numpy()
method to convert the DataFrame back to a numpy array as follows:
data = df.to_numpy() print(data)
[[ 0.41339801] [ 0. ] [ 0. ] [ 0.7854321 ] [ 0.23319745] [ 0. ] [ 0.50342482] [ 0. ] [-0.82102161] [-0.81934623] [ 0.23176869] [-0.61882322] [ 0.12434801] [-0.21218049] [ 0. ] [-1.54067848] [ 0. ] [ 1.78086445] [ 0. ] [ 0.4881174 ]]
We can now fit on the clean data using the AffinityPropagation class as follows:
af= AffinityPropagation(random_state=5).fit(data) print(af.cluster_centers_)
The clustering algorithm gives us the following cluster centres:
[[ 0. ] [ 0.50342482] [-0.81934623] [-1.54067848] [ 1.78086445]]
We can also use Pandas to drop columns with NaN values using the dropna()
method. For further reading on using Pandas for data preprocessing, go to the article: Introduction to Pandas: A Complete Tutorial for Beginners.
Solution #3: using SimpleImputer
Let’s look at an example of using the SimpleImputer to replace NaN and infinity values. First, we will look at the data generation:
import numpy as np n = 4 data = np.random.randn(20) index_nan = np.random.choice(data.size, n, replace=False) index_inf = np.random.choice(data.size, n, replace=False) data.ravel()[index_nan]=np.nan data.ravel()[index_inf]=np.inf print(data)
[-0.5318616 nan 0.12842066 inf inf nan 1.24679674 0.09636847 0.67969774 1.2029146 nan 0.60090616 -0.46642723 nan 1.58596659 0.47893738 1.52861316 inf -1.36273437 inf]
The data consists of twenty random values, four of which are NaN, four are infinity, and the rest are numerical values. Let’s try to use the SimpleImputer to clean our data:
from sklearn.impute import SimpleImputer imp_mean = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0) imputer = imp_mean.fit([data]) data = imputer.transform([data]) print(data)
ValueError: Input contains infinity or a value too large for dtype('float64').
We raise the error because the SimpleImputer method does not support infinite values. To solve this error, you can replace the np.inf with np.nan values as follows:
data[data==np.inf] = np.nan imp_mean = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=0) imputer = imp_mean.fit([data]) data = imputer.transform([data]) print(data)
With all infinity values replaced with NaN values, we can use the SimpleImputer to transform the data. Let’s look at the clean dataset:
[[-0.5318616 0. 0.12842066 0. 0. 0. 1.24679674 0.09636847 0.67969774 1.2029146 0. 0.60090616 -0.46642723 0. 1.58596659 0.47893738 1.52861316 0. -1.36273437 0. ]]
Consider the case where we have multi-dimensional data with NaN and infinity values, and we want to use the SimpleImputer method. In that case, we can replace the infinite by using the Pandas replace() method as follows:
from sklearn.impute import SimpleImputer data = [[7, 2, np.nan], [4, np.nan, 6], [10, 5, np.inf]] df = pd.DataFrame(data) df.replace([np.inf, -np.inf], np.nan, inplace=True) data = df.to_numpy()
Then we can use the SimpleImputer to fit and transform the data. In this case, we will replace the missing values with the mean along the column where each NaN value occurs.
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') imp_mean.fit(data) data = imp_mean.transform(data) print(data)
The clean data looks like this:
[[ 7. 2. 6. ] [ 4. 3.5 6. ] [10. 5. 6. ]]
Summary
Congratulations on reading to the end of this tutorial! If you pass a NaN or an infinite value to a function, you may raise the error: ValueError: input contains nan, infinity or a value too large for dtype(‘float64’). This commonly occurs as a result of not preprocessing data before analysis. To solve this error, check your data for NaN and inf values and either remove them or replace them with real numbers.
You can only replace NaN values with the SimpleImputer method. If you try to replace infinity values with the SimpleImputer, you will raise the ValueError. Ensure that you convert all positive and negative infinity values to NaN before using the SimpleImputer.
For further reading on ValueErrors, go to the article: How to Solve Python ValueError: I/O operation on closed file.
or further reading on Scikit-learn, go to the article: How to Solve Sklearn ValueError: Unknown label type: ‘continuous’.
Go to the online courses page on Python to learn more about coding in Python for data science and machine learning.
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.