# Anne B. Wiegle

### nuchar

- expression écrite

- nomen mihi
- pictura gratissima mihi
- play scaena- stage pictura
- excerpt from the play amphytrion
- conversational phrases
- plautus
- plays
- latin
- questions
- students

Course Content

Principles of Knowledge

• Introduction to Data Mining

Discovery in Data

• Association Analysis

Fall 2007

• Sequential Pattern Analysis

Chapter 4: Data Classification

• Classification and prediction

• Contrast Sets

Dr. Osmar R. Zaïane

• Data Clustering

• Outlier Detection

•Web Mining

University of Alberta

• Other topics if time permits (spatial data, biomedical data, etc.)

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

1 2

What is Classification?

Chapter 4 Objectives

The goal of data classification is to organize and

categorize data in distinct classes.

Learn basic techniques for data classification and

A model is first created based on the data distribution.

prediction.

The model is then used to classify new data.

Introduce techniques such as Neural Networks,

Given the model, a class can be predicted for new data.

Naïve Bayesian Classification, k-Nearest

With classification, I can predict in

Neighbors, Decision Trees & Associative

which bucket to put the ball, but I

Classifiers can’t predict the weight of the ball.

Realize the difference between supervised

classification, prediction and unsupervised

?

classification of data.

…

1 23 4 n

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

3 4Classification = Learning a Model

Application

Training Set (labeled)

Credit approval

Classification

Target marketing

Model

Medical diagnosis

Defective parts identification in manufacturing

Crime zoning

Treatment effectiveness analysis

Etc.

New unlabeled data Labeling=Classification

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

5 6

Classification is a three-step process

Classification is a three-step process

1. Model construction (Learning):

2. Model Evaluation (Accuracy):

• Each tuple is assumed to belong to a predefined class, as

Estimate accuracy rate of the model based on a test set.

determined by one of the attributes, called the class label.

– The known label of test sample is compared with the classified

• The set of all tuples used for construction of the model is called

result from the model.

training set.

– Accuracy rate is the percentage of test set samples that are

• The model is represented in the following forms:

correctly classified by the model.

– Test set is independent of training set otherwise over-fitting

• Classification rules, (IF-THEN statements),

will occur.

• Decision tree

• Mathematical formulae

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

7 8Classification is a three-step process

Classification with Holdout

Derive

Training

Estimate

Classifier

3. Model Use (Classification):

Accuracy

Data

(Model)

The model is used to classify unseen objects.

Data

• Give a class label to a new tuple

Testing

• Predict the value of an actual attribute

Data

• Holdout (partition D in two; one for test one for training)

• Random sub-sampling (repeated holdouts with different partitioning)

• K-fold cross validation (K partitions of D; do k experiments each with a

different test partition. No overlap as in Random sub-sampling. Typically k=10)

• Bootstrapping (N samples (typically of size 63% of D) with replacements)

• Leave-one-out (|D| experiments with |D|-1 for training and one for test)

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

9 10

1. Classification Process

2. Classification Process

(Learning)

(Accuracy Evaluation)

Classification

Algorithms

Classifier

Training

Testing

(Model)

Data

Data

Name Income Age Credit rating

Classifier

Name Income Age Credit rating

Bruce Low <30 bad

How accurate is the model?

(Model)

Tom Medium <30 bad

Dave Medium [30..40] good

Jane High <30 bad

William High <30 good

IF Income = ‘High’

Wei High >40 good

IF Income = ‘High’

Marie Medium >40 good OR Age > 30

Hua Medium [30..40] good

OR Age > 30

THEN CreditRating = ‘Good’

Anne Low [30..40] good

THEN CreditRating = ‘Good’

Chris Medium <30 bad

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

11 123. Classification Process

Improving Accuracy

(Classification)

Classifier 1

Classifier 2

Classifier

New

Classifier 3

Combine

(Model)

Data

Data

votes

…

New

Classifier n

Name Income Age Credit rating

Data

Credit Rating?

Paul High [30..40] ?

Composite classifier

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

13 14

Evaluating Classification Methods

Framework (Supervised Learning)

• Predictive accuracy

– Ability of the model to correctly predict the class label

• Speed and scalability

– Time to construct the model

Derive

– Time to use the model Training

Estimate

Classifier

Accuracy

Data

• Robustness

(Model)

Labeled

– Handling noise and missing values

Data

• Scalability

Testing

– Efficiency in large databases (not memory resident data)

Data

• Interpretability:

– The level of understanding and insight provided by the model

Unlabeled

• Form of rules

– Decision tree size New Data

– The compactness of classification rules

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

15 16™

¾

¾

z

z

™

z

™

™

z

¾

™

™

™

™

z

™

™

™

Lecture Outline

Classification Methods

Part I: Artificial Neural Networks (ANN) (1 hour)

Introduction to Neural Networks

• Biological Neural System

Neural Networks

• What is an artificial neural network?

Bayesian Classification

• Neuron model and activation function

Derive

Training

Classifier Estimate

K-Nearest Neighbour

Data Accuracy • Construction of a neural network

(Model)

Labeled

Learning: Backpropagation Algorithm

Data

Decision Tree Induction

Testing

Data

• Forward propagation of signal

Associative Classifiers

• Backward propagation of error

Unlabeled

New Data

• Example

Support Vector Machines

Part II: Bayesian Classifiers (Statistical-based) (1 hour)

Case-Based Reasoning

What is Bayesian Classification

Genetic Algorithms

Bayes theorem

Rough Set Theory

Naïve Bayes Algorithm

• Using Laplace Estimate

Fuzzy Sets

• Handling Missing Values and Numerical Data

Etc.

• Belief Networks

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

17 18

Biological Neurons

Human Nervous System

• The purpose of neurons: transmit information in the

• We have only just began to understand how

our neural system operates

form of electrical signals

• A huge number of neurons and

– it accepts many inputs, which are all added up in some way

interconnections between them

– if enough active inputs are received at once, the neuron will be

10

– 100 billion (i.e. 10 ) neurons in the brain

activated and fire; if not, it remain in its inactive state

10

• a full Olympic-sized swimming pool contains 10

• Cell body - contains nucleus holding the

• Structure of neuron

raindrops; the number of stars in the Milky Way

chromosomes

is of the same magnitude

• Dendrites

4

– 10 connections per neuron

• Axon

• Synapse

• Biological neurons are slower than computers

couples the axon with the dendrite of

-3 -9

– Neurons operate in 10 seconds , computers in 10 seconds another cell;

information is passed from one neuron

– The brain makes up for the slow rate of operation by a single neurone by

to another through synapses;

the large number of neurons and connections (think about the speed of face

no direct linkage across the junction,

recognition by a human, for example, and the time it takes fast computers to do the same

task.)

it is a chemical one.

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

19 20ƒ

ƒ

Operation of biological neurons

• Signals are transmitted between neurons by What is an Artificial Neural Network (NN)?

electrical pulses (action potentials, AP)

traveling along the axon;

A neural network is a data structure that supposedly simulates the

• When the potential at the synapse is raised

behaviour of neurons in a biological brain.

sufficiently by the AP, it releases chemicals

called neurotransmitters

- it may take the arrival of more than one AP before the synapse is triggered

A neural network is composed of layers of units interconnected.

• The neurotransmitters diffuse across the gap and chemically activate

Messages are passed along the connections from one unit to the

gates on the dendrites, that allows charged ions to flow

other. Messages can change based on the weight of the connection

• The flow of ions alters the potential of the dendrite and provides a

and the value in the node.

voltage pulse on the dendrite (post-synaptic-potential, PSP)

• some synapses excite the dendrite they affect, while others inhibit it

i

Input vector: x Output vector

• the synapses also determine the strength of the new input signal

• Each PSP travels along its dendrite and spreads over the soma (cell

Input nodes

body)

Output nodes

Hidden nodes

• The soma sums the effects of thousands PSPs; if the resulting potential

feedforward

exceeds a threshold, the neuron fires and generates another AP.

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

21 22

What is an Artificial Neural Network (NN)?

A Neuron

A network of many simple units (neurons, nodes)

• The units are connected by connections.

bias

θ

0.7

• Each connection has an associated numeric weight.

0.3

x

w

0.2 0

0

• Units receive inputs (from the environment or

other units) via the connections. They produce

…

x

w

1

1

output using their weights and the inputs (i.e. they

. . ∑

f

operate locally).

. .

.

.

output y

• A NN can be represented as a directed graph.

x

w

n

n

NNs learn from examples and exhibit some capability

for generalization beyond the training data.

Input weight weighted Activation

• knowledge is acquired by the network from its environment via learning

vector x vector w sum function

and is stored in the weights of the connections.

•The n-dimensional input vector x is mapped into

• the training (learning) rule – a procedure for modifying the weights of

connections in order to perform a certain task.

variable y by means of the scalar product and a

• There are also some sophisticated techniques that allow learning by

nonlinear function mapping.

adding and pruning connections (between nodes).

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

23 24Neuron Model

Activation function

• Each connection from unit i to j has a numeric weigh w associated with it,

ij

• Activation function, processing element, squashing

which determines the strength and the sign of the connection

function, firing rule…

• Each neuron first computes the weighed sum of its inputs w , and then

p

• Is applied by each neuron to its input values and weights

applies an activation function f to derive the output (activation) a

• A neuron may have a special weight called bias weight b . (as well as the bias) S = θ + Σ (x * W )

i i i=1..n ij ij

• NNs represent a function of their weights (parameters). By adjusting the

• Can be unipolar [0,1] bipolar [-1, 1]

Threshold or step

weights, we change this function. This is done by using a learning rule.

1

• The function can be

– Linear (f (S)=cS),

0

Sigmoid

1 if S>T

1

if there are 2 inputs p =2 and p =3, f(S)=

1 2

0 otherwise

w

11

– Thresholded (f (S)=1 if S>T; 0 otherwise),

and if w = 3, w =1, b = -1.5, then

11 12

Gaussian

1

a = f(2*3+3*1 -1.5) = f(7.5)

w 0

1R

-cS

– a Sigmoid (f (S)=1/(1+e )), 1

f(S)=

-cS

1 + e

What is f?

0

f(P *w + p *w + b) 2

-S2/v -S

1 11 2 12

–a Gaussian (f (S)=e ), etc.

f(S)= e v

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

25 26

Correspondence Between Artificial and

Constructing the Network

Biological Neurons

• The number of input nodes: Generally corresponds to the

• How this artificial neuron relates to the biological one?

dimensionality of the input tuples. Input is converted into binary

– input p (or input vector p) – input signal (or signals) at the

dendrite

and concatenated to form a bitstream.

– weight w (or weight vector w) - strength of the synapse (or

– Eg. age 20-80: 6 intervals

synapses)

• [20, 30) → 000001, [30, 40) → 000010, …., [70, 80) → 100000

– summer & transfer function - cell body

• [20, 30) → 001, [30, 40) → 010, …., [70, 80) → 110

– neuron output a - signal at the axon

• Number of hidden nodes: Determined by expert, or in some cases,

adjusted during training.

• Number of output nodes: Generally number of classes

– Eg. 10 classes

• 0000000001 → C1, 0000000010 → C2, …., 1000000000 → C10

• 0001 → C1, 0010 → C2, …., 1010 → C10

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

27 28Neural Networks - Pros and Cons

Learning Paradigms

• Advantages

– prediction accuracy is generally high.

– robust, works when training examples contain errors. (1) Classification (2) Reinforcement

adjust weights using adjust weights

– output may be discrete, real-valued, or a vector of several

Error = Desired - Actual using reinforcement

discrete or real-valued attributes.

– fast evaluation of the learned target function. Label

• Criticism

Inputs

Actual Output

– long training time.

– difficult to understand the learned function (weights).

– Typically for numerical data

Training

– not easy to incorporate domain knowledge.

data

– Design can be tedious and error prone (Too small: slow

Compare actual class with output

learning - Too big: instability or poor performance)

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

29 30

Learning Algorithms

Major Steps for Back

Propagation Network

• Back propagation for classification

• Constructing a network

• Kohonen feature maps for clustering

– input data representation

– selection of number of layers, number of nodes

• Recurrent back propagation for classification

in each layer.

• Radial basis function for classification

• Training the network using training data

• Pruning the network

• Adaptive resonance theory

• Interpret the results

• Probabilistic neural networks

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

31 32Network Training

Network Pruning

• The ultimate objective of training

– obtain a set of weights that makes almost all the

• Fully connected network will be hard to articulate

tuples in the training data classified correctly.

•Steps:

• n input nodes, h hidden nodes and m output nodes lead

– Initial weights are set randomly.

to h(m+n) links (weights)

– Input tuples are fed into the network one by one.

– Activation values for the hidden nodes are computed.

• Pruning: Remove some of the links without affecting

– Output vector can be computed after the activation values of

classification accuracy of the network.

all hidden node are available.

– Weights are adjusted using error (desired output - actual output)

and propagated backwards.

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

33 34

Backpropagation Network - Architecture Backpropagation Network – Architecture 2

Outlook Tempreature Humidity Windy Play

sunny hot high false No

• 1) A network with 1 or more hidden layers • 5) Neuron model - weighed sum of input signals + differentiable

sunny hot high true No

overcast hot high false Yes

transfer function

rain mild high false Yes

output neurons

rain cool normal false Yes

a = f(wp+b)

rain cool normal true No

1 output neuron for

overcast cool normal true Yes

each class

sunny mild high false No

sunny cool normal false Yes

hidden neurons

rain mild normal false Yes

(1 hidden layer)

sunny mild normal true Yes

overcast mild high true Yes

overcast hot normal false Yes

rain mild high true No

• any differentiable transfer function f can be used; most frequently the

1 input neuron for

sigmoid and tan-sigmoid (hyperbolic tangent sigmoid) functions are used:

inputs

each attribute

n − n

1 e − e

a =

a =

n −n

−n

• 2) Feedforward network - each neuron receives input only from the neurons

e + e

1+ e

in the previous layer

• 3) Typically fully connected - all neurons in a layer are connected with all

neurons in the next layer

• 4) Weights initialization – small random values, e.g. [-1,1]

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

35 36Architecture – Number of Input Units Number of Output Units

Outlook Tempreature Humidity Windy Play Outlook Tempreature Humidity Windy Play

sunny hot high false No sunny hot high false No

• Numerical data - typically 1 input unit • Typically 1 neuron for each class

sunny hot high true No sunny hot high true No

overcast hot high false Yes overcast hot high false Yes

for each attribute

target class ex1: 1 0

rain mild high false Yes rain mild high false Yes

• Categorical data – 1 input unit for each

rain cool normal false Yes rain cool normal false Yes

No Yes

rain cool normal true No rain cool normal true No

attribute value)

overcast cool normal true Yes overcast cool normal true Yes

• How many input units for the weather

sunny mild high false No sunny mild high false No

data?

sunny cool normal false Yes sunny cool normal false Yes

output layer

rain mild normal false Yes rain mild normal false Yes

sunny mild normal true Yes sunny mild normal true Yes

hidden layer(s)

overcast mild high true Yes overcast mild high true Yes

hidden layer(s) overcast hot normal false Yes overcast hot normal false Yes

rain mild high true No rain mild high true No

sunny overcast rainy hot mild cool high normal false true

sunny overcast rainy hot mild cool high normal false true

outlook temperature humidity windy

sunny hot high false No

outlook temperature humidity windy ex.1: 1 0 0 1 0 0 1 0 1 0

• Encoding of the input examples – typically binary depending on the value of the • Encoding of the targets (classes) – typically binary

attribute (on and off) e.g. class1 (no): 1 0, class2 (yes): 0 1

Other possibilities are also acceptable. For

e.g.: 100 100 10 01

Another possibility is to code the target class

example “Windy” could be coded with only

one unit: true or false (1 or 0). with only one unit: Yes or No (1 or 0).

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

37 38

Learning in Backpropagation NNs

Number of Hidden Layers and Units in Them

a

No Yes

1

• An art! Typically - by trial and error

2

…

Propagate p

Compare and

Propagate

• The task constrains the number of inputs and output units but not the

forward

calculate error

sunny overcast rainy hot …mild cool high normal false true error

number of hidden layers and neurons in them

outlook temperature humidity windy

adjustments

p

– Too many hidden layers and units (i.e. too many weights) – overfitting

Labeled backward

– Too few – underfitting, i.e. the NN is not able to learn the input-output mapping

data

d

– A heuristic to start with: 1 hidden layer with n hidden neurons,

sunny hot high false N

n=(inputs+output_neurons)/2

Idea of backpropagation learning

No Yes

target class ex1: 1 0

For each training example p

– Propagate p through the network and calculate the output a . Compare the desired

… d with the actual output a and calculate the error;

– Update weights of the network to reduce the error;

Until error over all examples < threshold

…

sunny overcast rainy hot mild cool high normal false true

• Why “backpropagation”? Adjusts the weights backwards (from the output to

the input units) by propagating the weight change ∆w

outlook temperature humidity windy

new old

w = w + ∆w

How to calculate the weight change?

pq pq pq

ex.1: 1 0 0 1 0 0 1 0 1 0

© Dr. Osmar R. Zaïane, 1999, 2007 © Dr. Osmar R. Zaïane, 1999, 2007

Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

39 40