How to Solve ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)

by | Programming, Python, Tips

This error occurs when you try to fit a regression model to data but do not convert categorical variables to dummy variables before fitting.

You can solve the error by using the pandas method get_dummies() to convert the categorical variable to dummy. For example,

import statsmodels.api as sm
import numpy as np
import pandas as pd

# Create DataFrame

df = pd.DataFrame({'tree':['X','X','X','X','X','Y','Y','Y','Y','Y'],
'age':[118,484,664,1004,1231,118,484,664,1582, 200],
'circumference':[30, 58,87, 115, 142,111, 156, 172, 203, 230],
'rings':[10, 12, 18, 20, 25,19,20,23,28,29]
})

# Convert 'tree' to dummy variable

df = pd.get_dummies(df, columns=['tree'], drop_first=True)

# View DataFrame

print(df)

This tutorial will go through how to solve the error with code examples.


Table of contents

Example

Let’s look at an example to reproduce the error. First, we will create a pandas DataFrame containing tree data.

import statsmodels.api as sm
import numpy as np
import pandas as pd

# Create DataFrame containing tree data

df = pd.DataFrame({'tree':['X','X','X','X','X','Y','Y','Y','Y','Y'],
'age':[118,484,664,1004,1231,118,484,664,1582, 200],
'circumference':[30, 58,87, 115, 142,111, 156, 172, 203, 230],
'rings':[10, 12, 18, 20, 25,19,20,23,28,29]
})

# Print DataFrame to console

print(df)
  tree   age  circumference  rings
0    X   118             30     10
1    X   484             58     12
2    X   664             87     18
3    X  1004            115     20
4    X  1231            142     25
5    Y   118            111     19
6    Y   484            156     20
7    Y   664            172     23
8    Y  1582            203     28
9    Y   200            230     29

Next, we will try to fit a multiple linear regression model using the tree type, circumference, and the number of rings as the predictor variables and age as the response variable.

# Define predictor variables

x = df[['tree', 'circumference','rings']]

# Define response variable

y = df['age']

# Add constant to predictor variables

x = sm.add_constant(x)

# Attempt to fit multiple linear regresssion model

model = sm.OLS(y, x).fit()

Let’s run the code to see what happens:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

The error occurs because the predictor variable is categorical, and we need to convert it to a dummy variable prior to fitting the regression model.

The ValueError is telling us explicitly that the dtype of the values in the ‘tree‘ column is object, which is the Pandas equivalent to the Python type str. The dtype needs to be numeric.

Solution

We can solve the error by converting the ‘tree’ variable to a dummy variable using the pandas.get_dummies() function. Let’s look at the revised code:

import statsmodels.api as sm
import numpy as np
import pandas as pd

# Create DataFrame

df = pd.DataFrame({'tree':['X','X','X','X','X','Y','Y','Y','Y','Y'],
'age':[118,484,664,1004,1231,118,484,664,1582, 200],
'circumference':[30, 58,87, 115, 142,111, 156, 172, 203, 230],
'rings':[10, 12, 18, 20, 25,19,20,23,28,29]
})

# Convert 'tree' to dummy variable

df = pd.get_dummies(df, columns=['tree'], drop_first=True)

# Print df to console

print(df)
    age  circumference  rings  tree_Y
0   118             30     10       0
1   484             58     12       0
2   664             87     18       0
3  1004            115     20       0
4  1231            142     25       0
5   118            111     19       1
6   484            156     20       1
7   664            172     23       1
8  1582            203     28       1
9   200            230     29       1

The tree column is now tree_Y, and the values 'X' and 'Y' are now 0 and 1, signifying True or False.

Next, we will fit the multiple linear regression model using the ‘tree_Y‘ variable.

# Define predictor variables

x = df[['tree_Y', 'circumference','rings']]

# Define response variable

y = df['age']

# Add constant to predictor variables

x = sm.add_constant(x)

# Fit multiple linear regression model

model = sm.OLS(y, x).fit()

# View model summary

print(model.summary())

Let’s run the code to see the result:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    age   R-squared:                       0.473
Model:                            OLS   Adj. R-squared:                  0.209
Method:                 Least Squares   F-statistic:                     1.794
Date:                Sun, 07 Aug 2022   Prob (F-statistic):              0.248
Time:                        16:59:02   Log-Likelihood:                -72.393
No. Observations:                  10   AIC:                             152.8
Df Residuals:                       6   BIC:                             154.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          -605.9168    802.842     -0.755      0.479   -2570.401    1358.568
tree_Y         -400.4425    527.041     -0.760      0.476   -1690.066     889.181
circumference    -3.9784     12.130     -0.328      0.754     -33.660      25.703
rings            97.0499    101.534      0.956      0.376    -151.395     345.494
==============================================================================
Omnibus:                        1.840   Durbin-Watson:                   2.385
Prob(Omnibus):                  0.399   Jarque-Bera (JB):                0.115
Skew:                          -0.088   Prob(JB):                        0.944
Kurtosis:                       3.495   Cond. No.                         902.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We successfully fit the regression model without error.

Summary

Congratulations on reading to the end of this tutorial!

For further reading on errors involving Pandas, go to the articles:

To learn more about Python for data science and machine learning, go to the online courses page on Python for the most comprehensive courses available.

Have fun and happy researching!

Profile Picture
Senior Advisor, Data Science | [email protected] | + posts

Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.

Buy Me a Coffee ✨