This error occurs when you try to fit a regression model to data but do not convert categorical variables to dummy variables before fitting.
You can solve the error by using the pandas method get_dummies() to convert the categorical variable to dummy. For example,
import statsmodels.api as sm import numpy as np import pandas as pd # Create DataFrame df = pd.DataFrame({'tree':['X','X','X','X','X','Y','Y','Y','Y','Y'], 'age':[118,484,664,1004,1231,118,484,664,1582, 200], 'circumference':[30, 58,87, 115, 142,111, 156, 172, 203, 230], 'rings':[10, 12, 18, 20, 25,19,20,23,28,29] }) # Convert 'tree' to dummy variable df = pd.get_dummies(df, columns=['tree'], drop_first=True) # View DataFrame print(df)
This tutorial will go through how to solve the error with code examples.
Example
Let’s look at an example to reproduce the error. First, we will create a pandas DataFrame containing tree data.
import statsmodels.api as sm import numpy as np import pandas as pd # Create DataFrame containing tree data df = pd.DataFrame({'tree':['X','X','X','X','X','Y','Y','Y','Y','Y'], 'age':[118,484,664,1004,1231,118,484,664,1582, 200], 'circumference':[30, 58,87, 115, 142,111, 156, 172, 203, 230], 'rings':[10, 12, 18, 20, 25,19,20,23,28,29] }) # Print DataFrame to console print(df)
tree age circumference rings 0 X 118 30 10 1 X 484 58 12 2 X 664 87 18 3 X 1004 115 20 4 X 1231 142 25 5 Y 118 111 19 6 Y 484 156 20 7 Y 664 172 23 8 Y 1582 203 28 9 Y 200 230 29
Next, we will try to fit a multiple linear regression model using the tree type, circumference, and the number of rings as the predictor variables and age as the response variable.
# Define predictor variables x = df[['tree', 'circumference','rings']] # Define response variable y = df['age'] # Add constant to predictor variables x = sm.add_constant(x) # Attempt to fit multiple linear regresssion model model = sm.OLS(y, x).fit()
Let’s run the code to see what happens:
ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).
The error occurs because the predictor variable is categorical, and we need to convert it to a dummy variable prior to fitting the regression model.
The ValueError is telling us explicitly that the dtype of the values in the ‘tree
‘ column is object
, which is the Pandas equivalent to the Python type str
. The dtype needs to be numeric.
Solution
We can solve the error by converting the ‘tree’ variable to a dummy variable using the pandas.get_dummies()
function. Let’s look at the revised code:
import statsmodels.api as sm import numpy as np import pandas as pd # Create DataFrame df = pd.DataFrame({'tree':['X','X','X','X','X','Y','Y','Y','Y','Y'], 'age':[118,484,664,1004,1231,118,484,664,1582, 200], 'circumference':[30, 58,87, 115, 142,111, 156, 172, 203, 230], 'rings':[10, 12, 18, 20, 25,19,20,23,28,29] }) # Convert 'tree' to dummy variable df = pd.get_dummies(df, columns=['tree'], drop_first=True) # Print df to console print(df)
age circumference rings tree_Y 0 118 30 10 0 1 484 58 12 0 2 664 87 18 0 3 1004 115 20 0 4 1231 142 25 0 5 118 111 19 1 6 484 156 20 1 7 664 172 23 1 8 1582 203 28 1 9 200 230 29 1
The tree column is now tree_Y
, and the values 'X'
and 'Y'
are now 0
and 1
, signifying True
or False
.
Next, we will fit the multiple linear regression model using the ‘tree_Y
‘ variable.
# Define predictor variables x = df[['tree_Y', 'circumference','rings']] # Define response variable y = df['age'] # Add constant to predictor variables x = sm.add_constant(x) # Fit multiple linear regression model model = sm.OLS(y, x).fit() # View model summary print(model.summary())
Let’s run the code to see the result:
OLS Regression Results ============================================================================== Dep. Variable: age R-squared: 0.473 Model: OLS Adj. R-squared: 0.209 Method: Least Squares F-statistic: 1.794 Date: Sun, 07 Aug 2022 Prob (F-statistic): 0.248 Time: 16:59:02 Log-Likelihood: -72.393 No. Observations: 10 AIC: 152.8 Df Residuals: 6 BIC: 154.0 Df Model: 3 Covariance Type: nonrobust ================================================================================= coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------- const -605.9168 802.842 -0.755 0.479 -2570.401 1358.568 tree_Y -400.4425 527.041 -0.760 0.476 -1690.066 889.181 circumference -3.9784 12.130 -0.328 0.754 -33.660 25.703 rings 97.0499 101.534 0.956 0.376 -151.395 345.494 ============================================================================== Omnibus: 1.840 Durbin-Watson: 2.385 Prob(Omnibus): 0.399 Jarque-Bera (JB): 0.115 Skew: -0.088 Prob(JB): 0.944 Kurtosis: 3.495 Cond. No. 902. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We successfully fit the regression model without error.
Summary
Congratulations on reading to the end of this tutorial!
For further reading on errors involving Pandas, go to the articles:
- How to Solve Pandas TypeError: empty ‘dataframe’ no numeric data to plot.
- How to Solve Python ValueError: Can only compare identically-labeled DataFrame objects
- How to Solve Python ValueError: Cannot mask with non-boolean array containing NA/NaN values
To learn more about Python for data science and machine learning, go to the online courses page on Python for the most comprehensive courses available.
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.