Data Mining and Machine Learning
Data mining is the search for hidden relationships in data sets.
- Businesses use data mining techniques to identify potentially useful information in their data.
- Data mining aids business decision making processes.
- Data mining techniques assume that the relationships which are to be discovered exist within the dataset being examined.
Machine learning is implementing some form of artificial “learning”, where “learning” is the ability to alter an existing model based on new information.
- Machine learning is utilized to improve decision-making models.
- Machine learning techniques assume that it’s possible to create a model appropriate for the environment being studied.
Business Applications of Data Mining and Machine Learning
Many businesses have a substantial amount of data, often with volume growing at a rapid rate. This makes cost effective manual data analysis virtually impossible. Therefore, businesses turn to data mining techniques to identify potentially useful information in their data, to aid business decision making processes and enhance business intelligence in general.
Machine learning leverages data mining and computational intelligence algorithms to improve decision making models. Example applications of data mining and machine learning to business uses include:
- Search Engines: Adapting search engine results to search behaviors and the preferences of search users. Determining the relevance of topics on a webpage to topics of a given keyword for which that webpage may be listed in the search engine result pages.
- Customer Relationship Management (CRM): Determining the probability a given customer will respond favorably to a certain interaction, typically sales and marketing activities, but also customer and technical support approaches.
- Human Resources: Determining the probability that a given recruit will be a successful fit in an organization. Predicting what incentives and company policies in general are most likely to achieve the desired HR results.
- Retail: Determining the probability that a given customer would prefer a certain product or certain user preferences, for example the product placement and recommender systems utilized by many online retailers.
- Fraud Analysis: Determining the probability that a given credit card transaction may be fraudulent.
- Pharmaceuticals: Using bioinformatics to analyze life science data to aid in future drug discovery and development processes. Analyzing demographic and health data to predict profitability of a future drug if it were brought to market.
What is Data Mining?
As a prerequisite for data mining we need a set of data. This dataset must contain the relationships we are interested in. Note, however, that the fact that we’re mining this data implies that we do not know the exact nature of these relationships, and often it’s the case that we don’t know what the possible relationships are. To ensure that we meet our assumption we need as large a dataset as possible. We will often create a data warehouse which holds all the data we generate and mine that.
The data warehouse is typically a large, relatively unstructured collection of tables which contain large amounts of raw data. Mining this dataset can be very time consuming and complicated, so the data is then preprocessed to make it easier to apply data mining techniques. Standard preprocessing tasks involve:
- Throwing out incomplete, uninteresting or outlier data, a process called “cleaning”
- Processing the remaining data in such a way as to reduce it to only the features deemed necessary to carry out the mining.
Each remaining entry is called a “feature vector”. We make the determination of what we are interested in finding and proceed accordingly. There are typically four kinds of things we are interested in finding.
- Clusters of data which are related in some way that is not found in the features
- Classifications of features and the ability to classify new data
- Statistical methods and/or mathematical functions which model the data
- Hidden relationships between features
Clustering involves separating a dataset into a set of clusters, such that elements of each cluster are similar in some fashion. The first step in this process is to determine the number of clusters to use. If this isn’t evident from the problem domain then there are techniques to determine a reasonable value, involving various levels of magic:
- Use the square root of ½ the number of feature vectors
- Graph the amount of variance found as a function of number of clusters and choose the number of clusters which yields the least variance
- Apply Rate Distortion Theory and pick the number of clusters just after the jump
Once you’ve determined the number of clusters to use there are standard algorithms to run over the dataset. Almost all these methods will calculate some distance measure between any two given points and then start assigning clusters appropriately.
Classification is different from clustering in that you know the classifications and you wish to “teach” the system how to classify incoming data. The standard techniques for this problem include Bayesian Filtering, nearest neighbor, and support vector machines. All of them require a period of training where they are presented with data and told which classification it belongs to. During this training period they are adjusting their internal models to yield the given results. After being trained the hope is that their internal models are accurate enough to predict the class of new data.
Modeling typically consists of performing regression analysis to model the data with the least amount of errors. Regression modeling attempts to fit a mathematical formula to the data which can then be used to make predictions or forecasts.
Finding hidden relationships between features is called Associate Rule Learning. The standard illustration is “if a person views x and y then they will most likely view z”. These sorts of questions are applied to problems involving product placement and to recommender systems.
What is Machine Learning?
Machine Learning refers to techniques which allow an algorithm to modify itself based on observing its performance such that its performance increases.
There are several machine learning algorithms, but most of them follow this general sequence of events:
- Determine how well you did
- Adjust parameters to do better
- Repeat until good enough
There are two general categories of machine learning algorithms.
- Supervised learning involves some process which trains the algorithm.
- Unsupervised learning algorithms will accept feedback from the environment and train themselves.
In the section above, clustering would be an example of unsupervised machine learning, while classification would be an example of supervised machine learning.
Neural networks and a priori are examples of machine learning algorithms.
Neural networks for classification
Neural networks simulate how the brain is wired up. They are a collection of nodes which have inputs and an output and a threshold value. If the value of the inputs exceeds the threshold value then the output is activated, otherwise the output is not. Training a neural network involves getting the threshold values correct such that a given input will produce the desired output.
A priori for rules discovery
A priori first prunes out infrequent transactions, then looks at all combinations of items and prunes out infrequent combinations, leaving us with frequent combinations of things.
This article summarized the fundamentals about the use of data mining to glean useful insights about information within an organization and about the use of machine learning to enhance decision-making about that data. The combination has potentially powerful ramifications for businesses.