Sigmoid and Tanh Activation Functions

May 21, 2019

Sigmoid and Tanh Activation Functions

This post will cover the sigmoid and hyperbolic tangent activation functions in detail. The applications, advantages and disadvantages of these two common activation functions will be covered, as well as finding their derivatives and visualizing each function.

Review
The Hyperbolic Tangent Function
Derivative of Tanh
The Sigmoid Function
Derivative of the Sigmoid Function
Plotting Activation Functions

Review

First off, a quick review of derivatives may be needed. A derivative is simply a fancy name for the slope. Everyone has heard of the slope of a function in middle school math class. When looking at any given point on a function’s graph, the derivative is the slope of the tangent line at that given point. Below is the code for a visual of this concept.

import numpy as np
import matplotlib.pyplot as plt

def f(x):
    return np.sin(np.power(x, 2))

x = np.linspace(1.5,4,200)
y = f(x)

a = 3
h = 0.001 # h approaches zero
derivative = (f(a + h)- f(a))/h # difference quotient
tangent = f(a) + derivative * (x - a) # finding the tangent line

plt.figure(figsize=(10,6))

plt.ylim(-2,2)
plt.xlim(1.5, 4)

plt.plot(x, y, linewidth=4, color='g', label='Function')
plt.plot(x, tangent, color='m', linestyle='--', linewidth='3', label='Tangent Line')
plt.plot(a, f(a), 'o', color='r', markersize=10, label='Example Point')

plt.legend(fontsize=12)
plt.grid(True)
plt.show()

Everyone learns the formula for the slope of a line in middle school. The slope formula is commonly remembered as “rise over run”, or the change in y divided by the change in x. A derivative is the same thing, the change in y over the change in x, this can be written like this:

$Slope = \frac{rise}{run} = \frac{\Delta y}{\Delta x}$

To find the derivative of a function, we can use the difference quotient. The difference quotient calculates the slope of the secant line through two points on a function’s graph. The two points would be x and (x + h). As the change in x approaches zero, the secant line will get increasingly closer to the tangent line, therefore becoming closer to the instantaneous rate of change, or the slope of a single point. The difference quotient is written like this:

$f'(x) = \lim_{\Delta x \rightarrow 0} \frac{f(x + \Delta x) - f(x)}{\Delta x}, \ \Delta x\neq 0$

Below is an example of how the difference quotient is used to find the derivative of a function. Note: “delta x” has been substituted with “h” for simplicity. The example function will be x².

$f(x) = x^2 \\ \newline \indent f'(x) = \frac{(x+h)^2-x^2}{h} = \frac{x^2+h^2+2xh - x^2}{h} = 2x+h = 2x$

To find the derivative of x², we would plug (x + h) into the function, then subtract f(x), which was given as x². The limit of the difference quotient is the instantaneous rate of change. In the difference quotient, “h” will approach zero, but will never become zero. In the example given below, we assume that “h” approaches close enough to zero to be treated as if it were zero, and therefore we can get rid of the final “h” as it would be the equivalent to adding zero to 2x. The example below uses the difference quotient to show that the derivative of x² is 2x.

The Hyperbolic Tangent Function

# EXAMPLE FUNCTION
def tanh(x, deriv = False):
    if deriv == True:
        return (1 - (tanh(x)**2)
    return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
    # return np.tanh(x)

The hyperbolic tangent function is a nonlinear activation function commonly used in a lot of simpler neural network implementations. Nonlinear activation functions are typically preferred over linear activation functions because they can fit datasets better and are better at generalizing. The tanh function is typically a better choice than the sigmoid function (AKA logistic function) because while both share a sigmoidal curve or S-shaped curve, the sigmoid function ranges from [0 to 1], while the tanh function ranges from [-1 to 1]. This greater range allows the tanh function to map negative inputs to negative outputs, while near-zero inputs will be mapped to near-zero outputs. Additionally, the tanh function makes data more centered around zero, which creates stronger and more useful gradients. The hyperbolic tangent function is simply less restricted than the logistic function.

Similar to the tangent function (sine over cosine), the hyperbolic tangent function is equal to the hyperbolic sine over the hyperbolic cosine.

$tanh(x) = \frac{sinh(x)}{cosh(x)} \ = \ \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$

The hyperbolic sine and the hyperbolic cosine can be written like this:

$sinh(x) = \frac{e^{x}-e^{-x}}{2} \ \ , \ cosh(x) = \frac{e^{x}+e^{-x}}{2}$

We can write the hyperbolic sine over the hyperbolic cosine. This will create a fraction divided by another fraction. This is the same as the numerator fraction multiplied by the denominator fraction’s reciprocal. This is commonly remembered as “Keep, Change, Flip”. After multiplying the two fractions, we will be able to factor out a two from the top and bottom. This will leave us with the hyperbolic tangent function.

$\frac{sinh(x)}{cosh(x)}= \frac{\frac{e^{x}-e^{-x}}{2}}{\frac{e^{x}+e^{-x}}{2}} = \frac{e^{x}-e^{-x}}{2} \times \frac{2}{e^{x}+e^{-x}} = \frac{2(e^{x}-e^{-x})}{2(e^{x}+e^{-x})} = \frac{\cancel 2(e^{x}-e^{-x})}{\cancel 2(e^{x}+e^{-x})} = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$

Derivative of Tanh

Now let’s find the derivative of the hyperbolic tangent function. We know that the hyperbolic tangent function is written as the hyperbolic sine over the hyperbolic cosine. When dealing with the derivative of a quotient of two functions, we can use the quotient rule. The quotient rule formula uses the function in the numerator as f(x) and the function in the denominator as g(x). The quotient rule formula is written as:

$\frac{\mathrm{d} }{\mathrm{d} x}\left [ \frac{f(x)}{g(x)} \right ] = \frac{g(x)f'(x) - f(x)g'(x)}{(g(x))^{2}}$

Using the quotient rule follows the steps below:

Take the product of the function f(x) and the derivative of g(x).
Take the product of the function g(x) and the derivatve of f(x).
Take the difference between the first product and the second product.
Divide by g(x) squared.

Now, if we apply the quotient rule to find the derivative of the hyperbolic tangent function, then it would look something like this: $\frac{\mathrm{d} }{\mathrm{d} x} tanh(x) = \frac{\mathrm{d} }{\mathrm{d} x}(\frac{sinh(x)}{cosh(x)}) = \frac{cosh(x)\cdot sinh'(x) - sinh(x)\cdot cosh'(x)}{cosh(x)^{2}}$

After doing this, we run into another problem. What are the derivatives of the sinh and cosh functions? To solve this, we will first need to find the deriative of e^x and e^-x. Recall, the functions sinh(x) and cosh(x) contain e^x and e^-x. e^x is known to be a very special function because its derivative is simply itself. Yes, the derivative of e^x equals the function value for all possible points. Why is this true though?

Derivative of e^x

Let’s start to take the derivative of e^x. First, we will use the difference quotient and plug in e^x for f(x). $\frac{\mathrm{d} }{\mathrm{d} x}e^{x} = \lim_{h \to 0}\frac{e^{(x+h)}-e^{x}}{h}$

Using the product rule of exponents, we can split up the e^(x+h) and make things a bit easier to deal with. After doing this, we will be able to factor out an e^x from the numerator and put it on the other side of the limit.

$\lim_{h \to 0}\frac{e^{(x+h)}-e^{x}}{h} = \lim_{h \to 0}\frac{e^{x}e^{h}-e^{x}}{h} = \lim_{h \to 0}\frac{e^{x}(e^{h}-1)}{h} = e^{x}\lim_{h \to 0}\frac{(e^{h}-1)}{h}$

Now things may start to look weird. Where do we go from here? Are we stuck? Before we progress any further, let’s take a look at the definition of Euler’s number (e). Euler’s number can be defined as:

$e = \lim_{n\to \infty}(1 + \frac{1}{n})^{n}$

Here comes the really cool part. If we rewrite the definition of e in terms of h (plug in 1/h for “n”), it would look like this:

$e = \lim_{h\to \infty}(1 + h)^{\frac{1}{h}}$

Looking back at where we last left off for the derivative of e^x, we notice that we have an e^h, so we can write our definition of “e” and raise it to the power of “h”. $[(1 + h)^{\frac{1}{h}}]^{h}$

Using the power rule, we can simplify this expression further. For reference, the power rule states that:

$(a^{b})^{c} = a^{bc}$

Now if we apply this to our definition of “e” raised to the power of “h”, all of the exponents would cancel each other out. Remember that anything raised to the power of one will always equal itself. This process will look like this:

$[(1 + h)^{\frac{1}{h}}]^{h} = (1 + h)^{\frac{1}{h}h} = (1+h)^{\frac{1}{h}\times \frac{h}{1}} = (1 + h)^{\frac{h}{h}} = (1+h)$

Now that we know what e^h equals, we can go back to finding the derivative of e^x and rewrite it by replacing e^h with its definition like so: $\frac{\mathrm{d} }{\mathrm{d} x}e^{x} = e^{x}\lim_{h \to 0}\frac{(e^{h}-1)}{h} = e^{x}\lim_{h \to 0}\frac{(1+h)-1}{h} = e^{x}\lim_{h \to 0}\frac{h}{h}$

We can see that the limit is equal to one. h/h equals one, and as h approaches zero, it will always equal one. The derivative of e^x becomes this:

$\frac{\mathrm{d} }{\mathrm{d} x}e^{x} = e^{x}\lim_{h \to 0}\frac{h}{h} = e^{x}(1) = e^{x}$

Derivative of e^-x

Finding the derivative of e^-x will be much easier now that we know the derivative of e^x. We could solve the derivative of e^-x the same way that we solved the derivative of e^x, but I want to show a different way where we use the quotient rule. First things first, we will need to rewrite e^-x like this:

$\frac{\mathrm{d} }{\mathrm{d} x}e^{-x} = \frac{\mathrm{d} }{\mathrm{d} x}(\frac{1}{e^{x}})$

Using the quotient rule that was mentioned above, the derivative of e^-x can be written like this:

$\frac{\mathrm{d} }{\mathrm{d} x}(\frac{1}{e^{x}}) = \frac{\frac{\mathrm{d} }{\mathrm{d} x}1(e^{x}) - (1)\frac{\mathrm{d} }{\mathrm{d} x}e^{x}}{(e^{x})^{2}} = \frac{-1(e^{x})}{(e^{x})^{2}} = \frac{-e^{x}}{(e^{x})^{2}}$

Now we can divide by e^x and get:

$\frac{-e^{x}}{(e^{x})^{2}} = \frac{-1}{e^{x}}$

To finish everything, we can bring the e^x back up into the numerator by making the exponent negative.

$\frac{-1}{e^{x}} = -1(e^{-x}) = -e^{-x}$

Now we know the derivatives of both e^x and e^-x. This means we can finally start to find the derivative of the hyperbolic sine and cosine functions.

Derivative of Tanh Continued

We are now fully equipped to find the derivatives of the hyperbolic sine and cosine functions. Since we know the derivatives of e^x and e^-x, this should be easy. The process of finding the derivative of sinh looks like this:

$sinh'(x) = \frac{\mathrm{d} }{\mathrm{d} x}(\frac{e^{x}-e^{-x}}{2}) = \frac{1}{2}(\frac{\mathrm{d} }{\mathrm{d} x}e^{x} - \frac{\mathrm{d} }{\mathrm{d} x}e^{-x})) \\\\ \indent \indent \indent \indent = \frac{1}{2}(e^{x} - (-e^{-x})) = \frac{e^{x}+e^{-x}}{2} = cosh(x)$

Next, the derivative of cosh can be found like this:

$cosh'(x) = \frac{\mathrm{d} }{\mathrm{d} x}(\frac{e^{x}+e^{-x}}{2}) = \frac{1}{2}(\frac{\mathrm{d} }{\mathrm{d} x}e^{x} + \frac{\mathrm{d} }{\mathrm{d} x}e^{-x})) \\\\ \indent \indent \indent \indent = \frac{1}{2}(e^{x} + (-e^{-x})) = \frac{e^{x}-e^{-x}}{2} = sinh(x)$

We can now go back to when we started to find the derivative of tanh and rewrite it to look this this:

$\frac{cosh(x)\cdot sinh'(x) - sinh(x)\cdot cosh'(x)}{cosh(x)^{2}} = \frac{cosh(x)\cdot cosh(x)- sinh(x)\cdot sinh(x) }{cosh(x)^{2}} \newline \newline \newline \indent = \frac{cosh(x)^{2} - sinh(x)^{2}}{cosh(x)^{2}} = \frac{cosh(x)^{2}}{cosh(x)^{2}} - \frac{sinh(x)^{2}}{cosh(x)^{2}} = 1 - tanh^{2}(x)$

After all of that work, we have finally found the derivative of the hyperbolic tangent function.

$\frac{\mathrm{d} }{\mathrm{d} x}tanh(x) = 1 - tanh^{2}(x)$

The Sigmoid Function

# EXAMPLE FUNCTION
def sigmoid(x, deriv = False):
    if deriv == True:
        return x * (1 - x)
    return 1 / (1 + np.exp(-x))

The sigmoid function, also known as the logistic function, is often very helpful when predicting an output between 0 and 1, such as probabilities and binary classification problems. As mentioned earlier, the sigmoid function squeezes all values to be within a range of [0 to 1], this causes all negative inputs to be mapped at, or close to zero, which can cause some problems. The most commonly known issue with the sigmoid function is the vanishing gradient problem. The vanishing gradient problem is common amongst neural networks with many layers that utilize the sigmoid function. Because the sigmoid function’s derivatives are small, when they are multiplied during backpropagation they get smaller and smaller until eventually becoming useless. The smaller the gradients, the less effect the backpropagation is. Weights will not be updated at a useful rate, and the neural network will seem to be stuck. The sigmoid function is not as popular as it once was due to this problem, but for simple neural network architectures, it can still be a very fast and efficient activation function that gets the job done properly. The sigmoid function is written below:

$S(x) = \frac{1}{1 + e^{-x}}$

Derivative of the Sigmoid Function

It is time to find the derivative of the sigmoid function. The sigmoid function will be denoted as S(x) (as shown above). Just as we did with the tanh function, we will use the quotient rule to find the derivative of the sigmoid function. As a reminder, the quotient rule is written below:

$\frac{\mathrm{d} }{\mathrm{d} x}\left [ \frac{f(x)}{g(x)} \right ] = \frac{g(x)f'(x) - f(x)g'(x)}{(g(x))^{2}}$

Now, we can plug our values into the quotient rule’s formula. This will look like this:

$S'(x) = \frac{(1 + e^{-x})\times \frac{\mathrm{d} }{\mathrm{d} x}(1) - (1)\frac{\mathrm{d} }{\mathrm{d} x}(1 + e^{-x})}{(1 + e^{-x})^{2}}$

Now we can simplify and because anything multiplied by one is itself, we can simplify it like this:

$\frac{e^{-x}}{(1 + e^{-x})^{2}} = \frac{1}{1+e^{-x}}\cdot \frac{e^{-x}}{1+e^{-x}}$

To further progress through our derivation, we can add one and subtract one on the numerator. This will not change the function because one minus one is zero. We are essentially using a fancy method of adding zero to the numerator to help simplify. The above expression then becomes:

$\frac{1}{1+e^{-x}}\cdot \frac{1+e^{-x}-1}{1+e^{-x}} = \frac{1}{1+e^{-x}} \cdot (\frac{1+e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}})$

From here on out we just simplify the above expression to find the derivative of the sigmoid function.

$\frac{1}{1+e^{-x}} \cdot (1 - \frac{1}{1+e^{-x}})$

If the sigmoid function is written as S(x), then the derivative of the sigmoid function is:

$S(x)\cdot (1 - S(x))$

Plotting Activation Functions

Sometimes having a visual can make the process a bit more intuitive. Below will be a comparison of two activation functions: the sigmoid function (logistic function), and the hyperbolic tangent function. This visual comparison may help to understand the differences and similarities between the two activation functions. To create this visual, we will have to use Matplotlib, a useful python plotting library. We will create two subplots, one will compare the two activation functions (left) and the other will compare their derivatives (right).

The code for this visualization is below:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(20,5), facecolor='#F0EEEE')

# Functions Axes Lines
ax[0].axvline(linewidth=1.3, color='k')
ax[0].axhline(linewidth=1.3, color='k')

# Derivatives Axes Lines
ax[1].axvline(linewidth=1.3, color='k')
ax[1].axhline(linewidth=1.3, color='k')

x = np.arange(-5,5,.0001)
# y = tanh, z = sigmoid
y = np.tanh(x)
z = 1 / (1 + np.exp(-x))

# Plot Activation Functions
ax[0].plot(x, y, linewidth=4, color='r', label='Tanh')
ax[0].plot(x, z, linewidth=4, color='b', label='Sigmoid')

# Derivative Equations
tanh_deriv = (1 - tanh(x)**2)
sig_deriv = np.exp(-x) / (1 + np.exp(-x)**2)

# Plot Derivatives
ax[1].plot(x, tanh_deriv,linewidth=4, color='g', label='Tanh Derivative')
ax[1].plot(x, sig_deriv,linewidth=4, color='m', label='Sigmoid Derivative')

# Main Title
fig.suptitle("Sigmoid vs Tanh", fontsize=25)

# Grid
ax[0].grid(True)
ax[1].grid(True)

# Legend
ax[1].legend(fontsize=15)
ax[0].legend(fontsize=15)
# First Graph Limits (looks prettier)
ax[0].set_xlim(-5, 5)
ax[0].set_ylim(-1, 1)

plt.show()

As mentioned earlier, the hyperbolic tangent function shares an S-shape curve with the sigmoid function. The left plot will show the ranges of the two activation functions, and will also show that the hyperbolic tangent function is centered around zero, while the sigmoid function is not. Notice that the sigmoid function cannot go into the negatives. This means that all negative inputs will be mapped to zero because the sigmoid function is not able to go any lower than zero. The right plot will show the stronger derivative that the hyperbolic tangent function has over the sigmoids less pronounced derivative.

Sigmoid and Tanh Activation Functions

Sigmoid and Tanh Activation Functions

Review

The Hyperbolic Tangent Function

Derivative of Tanh

Derivative of ex

Derivative of e-x

Derivative of Tanh Continued

The Sigmoid Function

Derivative of the Sigmoid Function

Plotting Activation Functions

Derivative of e^x

Derivative of e^-x