Apr-19-2020, 06:12 PM
Hey guys! First of all, let me say that I am completely new into this. I am trying to do my capstone and I've been trying to study python but things are going down hills haha. I need to train my code to create a demand forecast based on previous sales. I am usind Spyder (via Anaconda) and I am getting an error that I have no idea how to fix it.
The erros is: "ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required."
It seems that the error happens in that "#SEGUNDO TREINO DE ERRO" part. In that part I need to "train" the code to dicrease the rmsle.
Here is my code:
The erros is: "ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required."
It seems that the error happens in that "#SEGUNDO TREINO DE ERRO" part. In that part I need to "train" the code to dicrease the rmsle.
Here is my code:
# IMPORTAR BIBLIOTECA
import pandas as pd
import numpy as np
from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
ipy.run_line_magic('matplotlib', 'inline')
from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
# IMPORTAR ARQUIVO
data = pd.read_csv(r"C:\Users\Marcella\Documents\FEI\9 ciclo\TCC1\Banco de dados\Empresa Leo\SKU_csv2.csv", sep = ';')
df = pd.DataFrame(data)
# CRIAR COLUNA "PERÍODO" COM "ANO" E "MÊS"
data["Period"] = data["Year"].astype(str) + "-" + data["Month"].astype(str)
# We use the datetime formatting to make sure format is consistent
data["Period"] = pd.to_datetime(data["Period"]).dt.strftime("%Y-%m")
data3 = data.filter(regex=r'Code|Timeline|Quantity')
data3.head()
#INVERTER A ORDEM DA TABELA
df = pd.DataFrame(data3)
dfOrdenado = df.sort_values(by = 'Code', ascending = True)
dfOrdenado.head()
#DIFERENÇA DE VOLUME TIMELINE ATUAL E ANTERIOR (MES ATUAL-MES ANTERIOR)
data2 = dfOrdenado.copy()
data2['Last_Month_Quantity'] = data2.groupby(['Code'])['Quantity'].shift(-1)
data2['Last_Month_Diff'] = data2.groupby(['Code'])['Last_Month_Quantity'].diff()
data2 = data2.dropna()
data2.head()
#PRIMEIRO TREINO DE ERRO
def rmsle(ytrue, ypred):
return np.sqrt(mean_squared_log_error(ytrue, ypred))
mean_error = []
for Timeline in range(1,36):
train = data2[data2['Timeline'] < Timeline]
val = data2[data2['Timeline'] == Timeline]
p = val['Last_Month_Quantity'].values
error = rmsle(val['Quantity'].values, p)
print('Timeline %d - Error %.5f' % (Timeline, error))
mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))
#HISTOGRAMA DO ERRO
data2['Quantity'].hist(bins=20, figsize=(10,5))
# SEGUNDO TREINO DE ERRO
mean_error = []
for Timeline in range(1,36):
train = data2[data2['Timeline'] < Timeline]
val = data2[data2['Timeline'] == Timeline]
xtr, xts = train.drop(['Quantity'], axis=1), val.drop(['Quantity'], axis=1)
ytr, yts = train['Quantity'].values, val['Quantity'].values
mdl = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0)
mdl.fit(xtr, ytr)
p = mdl.predict(xts)
error = rmsle(yts, p)
print('Timeline %d - Error %.5f' % (Timeline, error))
mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))And here is the Output:IPython 7.12.0 -- An enhanced Interactive Python.
# IMPORTAR BIBLIOTECA
import pandas as pd
import numpy as np
from IPython import get_ipython
ipy = get_ipython()
if ipy is not None:
ipy.run_line_magic('matplotlib', 'inline')
from sklearn.metrics import mean_squared_log_error
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
# IMPORTAR ARQUIVO
data = pd.read_csv(r"C:\Users\Marcella\Documents\FEI\9 ciclo\TCC1\Banco de dados\Empresa Leo\SKU_csv2.csv", sep = ';')
df = pd.DataFrame(data)
# CRIAR COLUNA "PERÍODO" COM "ANO" E "MÊS"
data["Period"] = data["Year"].astype(str) + "-" + data["Month"].astype(str)
# We use the datetime formatting to make sure format is consistent
data["Period"] = pd.to_datetime(data["Period"]).dt.strftime("%Y-%m")
data3 = data.filter(regex=r'Code|Timeline|Quantity')
data3.head()
#INVERTER A ORDEM DA TABELA
df = pd.DataFrame(data3)
dfOrdenado = df.sort_values(by = 'Code', ascending = True)
dfOrdenado.head()
#DIFERENÇA DE VOLUME TIMELINE ATUAL E ANTERIOR (MES ATUAL-MES ANTERIOR)
data2 = dfOrdenado.copy()
data2['Last_Month_Quantity'] = data2.groupby(['Code'])['Quantity'].shift(-1)
data2['Last_Month_Diff'] = data2.groupby(['Code'])['Last_Month_Quantity'].diff()
data2 = data2.dropna()
data2.head()
#PRIMEIRO TREINO DE ERRO
def rmsle(ytrue, ypred):
return np.sqrt(mean_squared_log_error(ytrue, ypred))
mean_error = []
for Timeline in range(1,36):
train = data2[data2['Timeline'] < Timeline]
val = data2[data2['Timeline'] == Timeline]
p = val['Last_Month_Quantity'].values
error = rmsle(val['Quantity'].values, p)
print('Timeline %d - Error %.5f' % (Timeline, error))
mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))
#HISTOGRAMA DO ERRO
data2['Quantity'].hist(bins=20, figsize=(10,5))
# SEGUNDO TREINO DE ERRO
mean_error = []
for Timeline in range(1,36):
train = data2[data2['Timeline'] < Timeline]
val = data2[data2['Timeline'] == Timeline]
xtr, xts = train.drop(['Quantity'], axis=1), val.drop(['Quantity'], axis=1)
ytr, yts = train['Quantity'].values, val['Quantity'].values
mdl = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0)
mdl.fit(xtr, ytr)
p = mdl.predict(xts)
error = rmsle(yts, p)
print('Timeline %d - Error %.5f' % (Timeline, error))
mean_error.append(error)
print('Mean Error = %.5f' % np.mean(mean_error))
Timeline 1 - Error 2.70350
Timeline 2 - Error 1.61701
Timeline 3 - Error 3.18454
Timeline 4 - Error 2.40659
Timeline 5 - Error 1.45284
Timeline 6 - Error 0.69815
Timeline 7 - Error 1.02462
Timeline 8 - Error 1.93734
Timeline 9 - Error 0.48172
Timeline 10 - Error 1.87422
Timeline 11 - Error 2.91395
Timeline 12 - Error 2.15465
Timeline 13 - Error 2.24474
Timeline 14 - Error 1.58562
Timeline 15 - Error 1.24788
Timeline 16 - Error 0.20848
Timeline 17 - Error 0.72884
Timeline 18 - Error 0.10210
Timeline 19 - Error 0.55287
Timeline 20 - Error 2.73459
Timeline 21 - Error 1.87676
Timeline 22 - Error 3.05041
Timeline 23 - Error 0.97720
Timeline 24 - Error 1.62730
Timeline 25 - Error 1.85567
Timeline 26 - Error 2.42298
Timeline 27 - Error 0.91488
Timeline 28 - Error 0.88662
Timeline 29 - Error 2.16283
Timeline 30 - Error 1.81922
Timeline 31 - Error 1.46269
Timeline 32 - Error 0.53905
Timeline 33 - Error 0.27669
Timeline 34 - Error 1.87140
Timeline 35 - Error 1.87198
Mean Error = 1.58486
Traceback (most recent call last):
File "<ipython-input-1-587546307fe9>", line 70, in <module>
mdl.fit(xtr, ytr)
File "C:\Users\Marcella\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py", line 295, in fit
X = check_array(X, accept_sparse="csc", dtype=DTYPE)
File "C:\Users\Marcella\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 586, in check_array
context))
ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required.Could anyone help me with this? Thank you so much in advance!
