Towards Better Keras Modeling – Part VII

The Alpha Scientist demonstrates the Relative Importance of Features. See the previous installment in this series to learn about Multivariate Effects.

Relative Importance of Features

With so many findings, where do we start? I’ll run a quick random forest regression model and test the relative significance of each hyperparameter in overall model performance:

In [70]:

from sklearn.preprocessing import MinMaxScaler
X = df[[‘first_neuron’,’hidden_neuron’,’hidden_layers’,’dropout’]]
scaler = MinMaxScaler()
y = scaler.fit_transform(df[[‘val_loss_improvement’]])

from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(max_depth=3,n_estimators=100)
reg.fit(X,y)
pd.Series(reg.feature_importances_,index=X.columns).\
sort_values(ascending=True).plot.barh(color=’grey’,title=’Feature Importance of Hyperparameters’)

Out [70]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f4231951710>

It appears that number of hidden layers is – by far – most important, followed by size of hidden layers. Of course, if we drop to zero hidden layers, then first layer size become supremely important.

For the next hyperparameter sweep, I’ll focus on larger layer sizes – and fewer layers.

In [2]:

## Experiment 2:
from keras.models import Sequential
from keras.layers import Dropout, Dense
from keras.callbacks import TensorBoard
from talos.model.early_stopper import early_stopper

# track performance on tensorboard
tensorboard = TensorBoard(log_dir=’./logs’,
histogram_freq=0,batch_size=10000,
write_graph=False,
write_images=False)

# (1) Define dict of parameters to try
p = {‘first_neuron’:[100,200,400,800,1600,3200],
‘hidden_neuron’:[100, 200, 400, 800 ],
‘hidden_layers’:[0,1],
‘batch_size’: [10000],
‘kernel_initializer’: [‘uniform’], #’normal’
‘epochs’: [100], # increased in case larger dimensions take longer to train
‘dropout’: [0.0,0.25],
‘last_activation’: [‘sigmoid’]}

# (2) create a function which constructs a compiled keras model object
def numerai_model(x_train, y_train, x_val, y_val, params):
print(params)

model = Sequential()

## initial layer
activation=’relu’,
kernel_initializer = params[‘kernel_initializer’] ))

## hidden layers
for i in range(params[‘hidden_layers’]):
kernel_initializer=params[‘kernel_initializer’]))

## final layer
kernel_initializer=params[‘kernel_initializer’]))

model.compile(loss=’binary_crossentropy’,
optimizer=params[‘optimizer’],
metrics=[‘acc’])

history = model.fit(x_train, y_train,
validation_data=[x_val, y_val],
batch_size=params[‘batch_size’],
epochs=params[‘epochs’],
callbacks=[tensorboard,early_stopper(params[‘epochs’], patience=10)], #,ta.live(),
verbose=0)
return history, model

# (3) Run a “Scan” using the params and function created above

t = ta.Scan(x=X_train.values,
y=y_train.values,
model=numerai_model,
params=p,
grid_downsample=1.00,
dataset_name=’numerai_example’,
experiment_no=’2′)

There we have it. Is this optimal? Almost certainly not. But I now have a much better understanding of how the model performs at various geometries – and have spent relatively little time performing plug-and-chug parameter tweaking.

At this point, I’ll build and train a single model with parameter values that showed most successful in the hyperparameter sweeps.

There are infinite possibilities for further optimizations, which I won’t explore here. For instance:

• RELU vs ELU unit types
• Various geometries of topography (funnel-shaped, etc…)
• Initializer type
• Optimizer type
• Feature extraction/selection methods (e.g., PCA)

https://alphascientist.com/hyperparameter_optimization_with_talos.html