Overview of Data Mining and Machine Learning Tech Talk by Lee Harkness


Data mining is the search for hidden relationships in data sets.  Machine learning is implementing some form of artificial “learning”, where “learning” is the ability to alter an existing model based on new information. Businesses use data mining techniques to identify potentially useful information in their data, in order to aid business decision making processes. Machine learning is utilized in order to improve these decision making models.


Data mining techniques assume that the relationships which are to be discovered actually exist within the dataset being examined.  Machine learning techniques assume that it’s possible to create a model appropriate for the environment being studied.

Business Applications of Data Mining and Machine Learning

Many businesses have a substantial amount of data, often times with volume growing at a rapid rate. This makes cost effective manual data analysis virtually impossible. Therefore businesses turn to data mining techniques to identify potentially useful information in their data, in order to aid business decision making processes and enhance business intelligence in general. Machine learning leverages data mining and computational intelligence algorithms in order to improve decision making models.

Example applications of data mining and machine learning to business uses include:

•    Software Engineering: Approaching certain software development and maintenance tasks as machine learning problems. This approach is especially useful in very large and complex software, software which may be used by or required to conform with many different organizations and/or systems, and software which must keep up with continually and rapidly changing environments. Example applications of data mining and machine learning to software engineering are software quality models, predicting the cost of software development, software development effort estimation, maintenance effort prediction, software defect prediction, improving software modularity, generating test data, project management rules, database schemas, and even in some rare cases software programs/scripts themselves.

•    Search Engines: Adapting search engine results to search behaviors and the preferences of search users. Determining the relevance of topics on a webpage to topics of a given keyword for which that webpage may be listed in the search engine result pages.

•    Customer Relationship Management (CRM):  Determining the probability a given customer will respond favorably to a certain interaction, typically sales and marketing activities, but also customer and technical support approaches.

•    Human Resources: Determining the probability that a given recruit will be a successful fit in an organization. Predicting what incentives and company policies in general are most likely to achieve the desired HR results.

•    Retail: Determining the probability that a given customer would prefer a certain product or certain user preferences, for example the product placement and recommender systems utilized by many online retailers.

•    Fraud Analysis: Determining the probability that a given credit card transaction may be fraudulent.

•    Pharmaceuticals: Using bioinformatics to analyze life science data in order to aid in future drug discovery and development processes. Analyzing demographic and health data to predict profitability of a future drug if it were brought to market.

What is Data Mining?

As a prerequisite for data mining we need a set of data.  As mentioned in the “assumptions” section this dataset must contain the relationships we are interested in.  Note, however, that the fact that we’re mining this data implies that we do not know the exact nature of these relationships, and often it’s the case that we don’t know what the possible relationships are.  So to ensure that we meet our assumption we need as large a dataset as possible.  So we will often create a data warehouse which holds all the data we generate and mine that.

The data warehouse is typically a large, relatively unstructured collection of tables which contain large amounts of raw data.  Mining this dataset can be very time consuming and complicated, so the data is then preprocessed to make it easier to apply data mining techniques.  Standard preprocessing tasks involve throwing out incomplete, uninteresting or outlier data, a process called “cleaning”, and processing the remaining data in such a way as to reduce it to only the features deemed necessary to carry out the mining.  Each remaining entry is called a “feature vector”.

At this point we have a large collection of feature vectors which we can mine.  We make the determination of what we are interested in finding and proceed accordingly.  There are typically four kinds of things we are interested in finding:
1.    Clusters of data which are related in some way that is not found in the features
2.    Classifications of features and the ability to classify new data
3.    Statistical methods and/or mathematical functions which model the data
4.    Hidden relationships between features

Clustering involves separating a dataset into a set of clusters, such that elements of each cluster are similar in some fashion.  The first step in this process is to determine the number of clusters to use.  If this isn’t evident from the problem domain then there are techniques to determine a reasonable value, involving various levels of magic:
1.    Use the square root of ½ the number of feature vectors
2.    Graph the amount of variance found as a function of number of clusters and choose the number of clusters which yields the least variance
3.    Apply Rate Distortion Theory and pick the number of clusters just after the jump

There are other methods, it’s something of an art.  Once you’ve determined the number of clusters to use there are standard algorithms to run over the dataset.  Almost all these methods will calculate some distance measure between any two given points and then start assigning clusters appropriately.

Classification is different from clustering in that you know the classifications and you wish to “teach” the system how to classify incoming data.  The standard techniques for this problem include Bayesian Filtering, nearest neighbor and support vector machines.  All of them require a period of training where they are presented with data and told which classification it belongs to.  During this training period they are adjusting their internal models to yield the given results.  After being trained the hope is that their internal models are accurate enough to predict the class of new data.

Modeling typically consists of performing regression analysis in order to model the data with the least amount of errors. Regression modeling attempts to fit a mathematical formula to the data which can then be used to make predictions or forecasts. 

Finding hidden relationships between features is called Associate Rule Learning.  The standard illustration is “if a person view x and y then they will most likely view z”.  These sorts of questions are applied to problems involving product placement and to recommender systems. 

What is Machine Learning?

Machine Learning refers to techniques which allow an algorithm to modify itself based on observing its performance such that its performance increases.  There are several machine learning algorithms, but most of them follow this general sequence of events:
1.    Execute
2.    Determine how well you did
3.    Adjust parameters to do better
4.    Repeat until good enough

The ability to determine how well the algorithm did is a prerequisite for machine learning.  A machine learning algorithm will perform some task, examine the experience, assess its performance, adjust its parameters and repeat until some fitness threshold is met.

There are two general categories of machine learning algorithms, supervised and unsupervised.  Supervised learning involves some process which trains the algorithm.  Unsupervised learning algorithms will accept feedback from the environment and train themselves. In the “What is Data Mining?” section clustering would be an example unsupervised machine learning, while classification would be an example of supervised machine learning.

Examples of machine learning algorithms:
1.    Neural networks for classification
2.   A priori for rules discovery

Neural networks simulate how the brain is wired up.  They are a collection of nodes which have inputs and an output and a threshold value.  If the value of the inputs exceeds the threshold value then the output is activated, otherwise the output is not.  Training a neural network involves getting the threshold values correct such that a given input will produce the desired output.

A priori first prunes out infrequent transactions, then looks at all combinations of items and prunes out infrequent combinations, leaving us with frequent combinations of things.

IT Experience Delivering Business Results