# Spambase Problem: Strongly Typed GP¶

This problem is a classification example using STGP (Strongly Typed Genetic
Programming). The evolved programs work on floating-point values AND Booleans
values. The programs must return a Boolean value which must be true if e-mail
is spam, and false otherwise. It uses a base of emails (saved in
*spambase.csv*, see Reference), from which it randomly chooses 400
emails to evaluate each individual.

## Primitives set¶

Strongly-typed GP is a more generic GP where each primitive, in addition to have an arity and a corresponding function, has also a specific return type and specific parameter(s) type. In this way, each primitive is someway describe as a pure C function, where each parameter has to be one of the good type, and where the return value type is specified before run time.

Note

Actually, when the user does not specify return or parameters type, a default type is selected by DEAP. On standard GP, because all the primitives use this default type, this behaves as there was no type requirement.

We define a typed primitive set almost the same way than a normal one, but we have to specify the types used.

```
spam = list(list(float(elem) for elem in row) for row in spamReader)
# defined a new primitive set for strongly typed GP
pset = gp.PrimitiveSetTyped("MAIN", itertools.repeat(float, 57), bool, "IN")
# boolean operators
pset.addPrimitive(operator.and_, [bool, bool], bool)
pset.addPrimitive(operator.or_, [bool, bool], bool)
pset.addPrimitive(operator.not_, [bool], bool)
# floating point operators
# Define a protected division function
def protectedDiv(left, right):
try: return left / right
except ZeroDivisionError: return 1
pset.addPrimitive(operator.add, [float,float], float)
pset.addPrimitive(operator.sub, [float,float], float)
pset.addPrimitive(operator.mul, [float,float], float)
pset.addPrimitive(protectedDiv, [float,float], float)
# logic operators
# Define a new if-then-else function
def if_then_else(input, output1, output2):
if input: return output1
else: return output2
pset.addPrimitive(operator.lt, [float, float], bool)
pset.addPrimitive(operator.eq, [float, float], bool)
pset.addPrimitive(if_then_else, [bool, float, float], float)
# terminals
```

On the first line, we see the declaration of a typed primitive set with
`PrimitiveSetTyped`

. The first argument remains the set name,
but the next ones are the type of the entries (in this case, we have 57 float
entries and one Boolean output; we could have written float 57 times, but
it is fairly quicker and more understandable to use the
`itertools.repeat()`

function). The last argument remains the entries
prefix.

After that, we define the primitives themselves. The definition of a typed primitive has two additional parameters : a list containing the parameters type, in order, and the return type.

The terminals set is then filled, with at least one terminal of each type, and that is for the primitive set declaration.

## Evaluation function¶

The evaluation function is very simple : it picks 400 mails at random in the spam database, and then checks if the prediction made by the individual matches the expected Boolean output. The count of well predicted emails is returned as the fitness of the individual (which is so, at most, 400).

```
def evalSpambase(individual):
# Transform the tree expression in a callable function
func = toolbox.compile(expr=individual)
# Randomly sample 400 mails in the spam database
spam_samp = random.sample(spam, 400)
# Evaluate the sum of correctly identified mail as spam
result = sum(bool(func(*mail[:57])) is bool(mail[57]) for mail in spam_samp)
return result,
```

## Toolbox¶

The toolbox used is very similar to the one presented in the symbolic regression example, but notice that we now use specific STGP operators for crossovers and mutations :

```
result = sum(bool(func(*mail[:57])) is bool(mail[57]) for mail in spam_samp)
return result,
toolbox.register("evaluate", evalSpambase)
toolbox.register("select", tools.selTournament, tournsize=3)
```

## Conclusion¶

Although it does not really differ from the other problems, it is interesting
to note how Python can decrease the programming time. Indeed, the spam
database is in csv form : with many frameworks, you would have to manually
read it, or use a non-standard library, but with Python, you can use the
built-in module `csv`

and, within 2 lines, it is done! The data is now
in the matrix *spam* and can easily be used through all the program :

The complete examples/%sgp/spambase

## Reference¶

Data are from the Machine learning repository, http://www.ics.uci.edu/~mlearn/MLRepository.html