# Symbolic Regression Problem: Introduction to GP¶

Symbolic regression is one of the best known problems in GP (see Reference). It is commonly used as a tuning problem for new algorithms, but is also widely used with real-life distributions, where other regression methods may not work. It is conceptually a simple problem, and therefore makes a good introductory example for the GP framework in DEAP.

All symbolic regression problems use an arbitrary data distribution, and try to fit the data with the most accurate symbolic formula available. Usually, measures like the RMSE (Root Mean Square Error) or MSE (Mean Squared Error) are used to measure an individual’s fitness.

In this example, we use a classical distribution, the quartic polynomial
\((x^4 + x^3 + x^2 + x)\), a one-dimension distribution. *20* equidistant
points are generated in the range [-1, 1], and are used to evaluate the
fitness.

## Creating the primitives set¶

One of the most crucial aspect of a GP program is the choice of the primitives set. They should make good building blocks for the individuals so the evolution can succeed. In this problem, we use a classical set of primitives, which are basic arithmetic functions :

```
# Define new functions
def protectedDiv(left, right):
try:
return left / right
except ZeroDivisionError:
return 1
pset = gp.PrimitiveSet("MAIN", 1)
pset.addPrimitive(operator.add, 2)
pset.addPrimitive(operator.sub, 2)
pset.addPrimitive(operator.mul, 2)
pset.addPrimitive(protectedDiv, 2)
pset.addPrimitive(operator.neg, 1)
pset.addPrimitive(math.cos, 1)
```

The redefinition of the division is made to protect it against a zero
division error (which would crash the program). The other functions are
simply a mapping from the Python `operator`

module. The number following
the function is the *arity* of the primitive, that is the number of entries
it takes.

On the last line, we declare an `MetaEphemeral`

constant. This is
a special terminal type, which does not have a fixed value. When the program
appends an ephemeral constant terminal to a tree, the function it contains is
executed, and its result is inserted as a constant terminal. In this case,
those constant terminals can take the values -1, 0 or 1.

The second argument of `PrimitiveSet`

is the number of
inputs. Here, as we have only a one dimension regression problem, there is
only one input, but it could have as many as you want. By default, those
inputs are named “ARGx”, where “x” is a number, but you can easily rename
them :

```
pset.addPrimitive(math.sin, 1)
```

## Creator¶

As any evolutionary program, symbolic regression needs (at least) two object types : an individual containing the genotype and a fitness. We can easily create them with the creator :

```
pset.renameArguments(ARG0='x')
```

The first line creates the fitness object (this is a minimization problem, so the weight is negative). The weights argument must be an iterable of weights, even if there is only one fitness measure. The second line creates the individual object itself. Very straightforward, we can see that it will be based upon a tree, to which we add a fitness. If, for any reason, the user would want to add any other attribute (for instance, a file in which the individual will be saved), it would be as easy as adding this attribute of any type to this line. After this declaration, any individual produced will contain those wanted attributes.

## Toolbox¶

Now, we want to register some parameters specific to the evolution process. In DEAP, this is done through the toolbox :

```
creator.create("Individual", gp.PrimitiveTree, fitness=creator.FitnessMin)
toolbox = base.Toolbox()
toolbox.register("expr", gp.genHalfAndHalf, pset=pset, min_=1, max_=2)
toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.expr)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("compile", gp.compile, pset=pset)
def evalSymbReg(individual, points):
# Transform the tree expression in a callable function
func = toolbox.compile(expr=individual)
# Evaluate the mean squared error between the expression
# and the real function : x**4 + x**3 + x**2 + x
sqerrors = ((func(x) - x**4 - x**3 - x**2 - x)**2 for x in points)
return math.fsum(sqerrors) / len(points),
toolbox.register("evaluate", evalSymbReg, points=[x/10. for x in range(-10,10)])
toolbox.register("select", tools.selTournament, tournsize=3)
toolbox.register("mate", gp.cxOnePoint)
toolbox.register("expr_mut", gp.genFull, min_=0, max_=2)
toolbox.register("mutate", gp.mutUniform, expr=toolbox.expr_mut, pset=pset)
```

First, a toolbox instance is created (in some problem types like coevolution, you may consider creating more than one toolbox). Then, we can register any parameters. The first lines register how to create an individual (by calling gp.genHalfAndHalf with the previously defined primitive set), and how to create the population (by repeating the individual initialization).

We may now introduce the evaluation function, which will receive an individual as input, and return the corresponding fitness. This function uses the compile function previously defined to transform the individual into its executable form – that is, a program. After that, the evaluation is only simple maths, where the difference between the values produced by the evaluated individual and the real values are squared and summed to compute the MSE (Mean Squared Error), which is returned as the fitness of the individual.

Warning

Even if the fitness only contains one measure, keep in mind that DEAP stores it as an iterable. Knowing that, you can understand why the evaluation function must return a tuple value (even if it is a 1-tuple) :

```
def evalSymbReg(individual, points):
# Transform the tree expression in a callable function
func = toolbox.compile(expr=individual)
# Evaluate the mean squared error between the expression
# and the real function : x**4 + x**3 + x**2 + x
sqerrors = ((func(x) - x**4 - x**3 - x**2 - x)**2 for x in points)
return math.fsum(sqerrors) / len(points),
```

Returning only the value would produce strange behaviors and errors, as the selection and stats functions relies on the fact that the fitness is always an iterable.

Afterwards, we register the evaluation function. We also choose the selection method (a tournament of size 3), the mate method (one point crossover with uniform probability over all the nodes), and the mutation method (a uniform probability mutation which may append a new full sub-tree to a node).

Then, we decorate the mate and mutate method to limit the height of generated individuals. This is done to avoid an important draw back of genetic programming : bloat. Koza in his book on genetic programming suggest to use a max depth of 17.

At this point, any structure with an access to the toolbox instance will also have access to all of those registered parameters. Of course, the user could register other parameters basing on their needs.

## Statistics¶

Although optional, statistics are often useful in evolutionary programming.
DEAP offers a simple class which can handle most of the “boring work”. In
this case, we want to compute the mean, standard deviation, minimum, and
maximum of both the individuals fitness and size. For that we’ll use a
`MultiStatistics`

object.

hof = tools.HallOfFame(1) stats_fit = tools.Statistics(lambda ind: ind.fitness.values) stats_size = tools.Statistics(len) mstats = tools.MultiStatistics(fitness=stats_fit, size=stats_size) mstats.register("avg", numpy.mean) mstats.register("std", numpy.std)

Note that a simple `Statistics`

object can be used, as in
previous examples when statistics over a single key are desired.

## Launching the evolution¶

At this point, DEAP has all the information needed to begin the evolutionary
process, but nothing has been initialized. We can start the evolution by
creating the population and then calling a complete algorithm. In this case,
we’ll use `eaSimple()`

.

```
random.seed(318)
mstats.register("max", numpy.max)
```

The hall of fame is a specific structure which contains the *n* best
individuals (here, the best one only).

The complete examples/%sgp/symbreg.

## Reference¶

*John R. Koza, “Genetic Programming: On the Programming of Computers by Means
of Natural Selection”, MIT Press, 1992, pages 162-169.*