Caching Activation Function Is Not Worth It
Let's continue with our NeuroEvolution series.
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil
© Donald Knuth
The statement above applies even to seemingly hi-tech areas like forecasting, data mining or neural networks.
Here's a simple situation.
The function below is called an activation function and it is the piece of code that eats up the most CPU time in old-fashioned neural networks:
static double sigmaActivation(double value)
{
return 1d / (1 + Math.Exp(-value));
}
One may be tempted to optimize this "bottleneck". Can you?
Caching
One of the options was to create a complete cache of all the function arguments and outputs to save the cost of the computations. The assumption is that this table will perform better and will have reasonably small footprint.
Let's prove that 4k is a mild understatement for size of this table.
We'll use this method to create float values within the range between -10 and +10:
static float CreateRandomFloat()
{
return (float)((Rand.NextDouble() - 0.5)*Rand.NextDouble()*20);
}
Note, that we a dropping number of values to be hashed by a couple of orders of magnitude here (normally synapse outputs are double and they can easily get out of the (-10;+10) range.
public static void Main()
{
var hash = new HashSet<float>();
ThreadStart a = () => {
while(true)
{
hash.AddRange(Range.Create(100, i => CreateRandomFloat()));
}
};
var thread = new Thread(a);
using (new DisposableAction(thread.Abort))
{
thread.Start();
do
{
Console.WriteLine("Press Esc to quit or any other key to poll hash");
var pureSize = hash.Count*sizeof (float)*2;
Console.WriteLine("Pure table size is {0} Mb", pureSize/1.Mb());
} while (Console.ReadKey(true).Key != ConsoleKey.Escape);
}
}
In the code snippet above we simply launch secondary thread to look up for new floats and put them into the HashSet. The HashSet will obviously keep only unique values.
Then, whenever user presses non-ESC key, we output current raw size of our imaginary look-up table.
After running this code for 20 seconds I've already hit 250 megabytes, and it kept growing.
Now, given the size of that thing and the amount of index walking that has to be done, it does not really make a lot of sense to use hashing, does not it?
And even if it would have a small performance improvement, I'd personally still stay away from this optimization, since the increased code complexity is just not worth it.
Lookup Table
Another possible way is to use lookup tables to break the input into a set of predefined intervals and then output some rough estimation instead.
Unfortunately, this option does not work for me in some situations (especially when using NEAT algorithm derivatives for complex tasks), since this approach messes up the search space for the training algorithm (by introducing local extrema all over the place) and makes it even more CPU intensive in the long run (also lowering the probability of success).
But it might work for your scenario. Here's the snippet posted by Henrik Gustafsson (with some performance tweaks). And kudos go to him for setting me straight on the lookup tables.
const float Scale = 320.0f;
const int Resolution = 2047;
const float Min = -Resolution/Scale;
const float Max = Resolution/Scale;
static readonly float[] _lut = InitLut();
static float[] InitLut()
{
var lut = new float[Resolution + 1];
for (int i = 0; i < Resolution + 1; i++)
{
lut[i] = (float) (1.0/(1.0 + Math.Exp(-i/Scale)));
}
return lut;
}
static float Sigmoid1(double value)
{
return (float) (1.0/(1.0 + Math.Exp(-value)));
}
static float Sigmoid2(float value)
{
if (value <= Min) return 0.0f;
if (value >= Max) return 1.0f;
var f = value*Scale;
if (value >= 0) return _lut[(int) (f + 0.5f)];
return 1.0f - _lut[(int) (0.5f - f)];
}
static float TestError()
{
var emax = 0.0f;
for (var x = -10.0f; x < 10.0f; x += 0.000001f)
{
var v0 = Sigmoid1(x);
var v1 = Sigmoid2(x);
var e = Math.Abs(v1 - v0);
if (e > emax) emax = e;
}
return emax;
}
static double TestPerformancePlain()
{
var sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 10; i++)
{
for (float x = -5.0f; x < 5.0f; x += 0.000001f)
{
Sigmoid1(x);
}
}
return sw.Elapsed.TotalMilliseconds;
}
static double TestPerformanceOfLut()
{
var sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 10; i++)
{
for (float x = -5.0f; x < 5.0f; x += 0.000001f)
{
Sigmoid2(x);
}
}
return sw.Elapsed.TotalMilliseconds;
}
static void Main()
{
var emax = TestError();
var t0 = TestPerformancePlain();
var t1 = TestPerformanceOfLut();
Console.WriteLine("Max detected deviation using LUT: {0}", emax);
Console.WriteLine("10^7 iterations using Sigmoid1() took {0} ms", t0);
Console.WriteLine("10^7 iterations using Sigmoid2() took {0} ms", t1);
Console.ReadKey(true);
}
Here are the results on my machine (relase mode, no debugger attached):
Max detected deviation using LUT: 0,001663984
10^7 iterations using Sigmoid1() took 3899,1979 ms
10^7 iterations using Sigmoid2() took 411,4441 ms
Keep in mind, that increasing precision will not necessarily improve the overall performance, since it will result in larger lookup table that has lower chance to fit into the CPU cache.
By the way, from my experience even switching from double to float could be enough to turn training some complex model task from hard to unfeasible.
That's it for now. The series on NeuroEvolution will definitely continue till they hit Cloud Computing. Stay tuned.
Update: next post shows that F# version of this algorithm has better performance.
PS: Range.Create, DisposableAction, 1.Mb and other shortcuts come from the Lokad.Shared.dll helper library that is used in this snippet.
PPS: you may also want to check Some performance issues and caveats of using LINQ in math code.
Monday, January 5, 2009 at 19:18
Reader Comments (5)
Between this and the StackOverflow thread, you've helped lots. Thanks :)
My pleasure))
Yea, this was fun :)
Since I can't find your e-mail anywhere, mind dropping me a line?
Boris,
you've got it incoming.