Data Clusters
- Clusters are collections of similar data
- Clustering is a type of unsupervised learning
- The Correlation Coefficient describes the strength of a relationship.
Clusters
Clusters are collections of data based on similarity.
Data points clustered together in a graph can often be classified into clusters.
In the graph below we can distinguish 3 different clusters:
Identifying Clusters
Clusters can hold a lot of valuable information, but clusters come in all sorts of shapes, so how can we recognize them?
The two main methods are:
- Using Visualization
- Using a Clustering Algorithm
Clustering
Clustering is a type of Unsupervised Learning.
Clustering is trying to:
- Collect similar data in groups
- Collect dissimilar data in other groups
Clustering Methods
- Density Method
- Hierarchical Method
- Partitioning Method
- Grid-based Method
The Density Method considers points in a dense regions to have more similarities and differences than points in a lower dense region. The density method has a good accuracy. It also has the ability to merge clusters.
Two common algorithms are DBSCAN and OPTICS.
The Hierarchical Method forms the clusters in a tree-type structure. New clusters are formed using previously formed clusters.
Two common algorithms are CURE and BIRCH.
The Grid-based Method formulates the data into a finite number of cells that form a grid-like structure.
Two common algorithms are CLIQUE and STING
The Partitioning Method partitions the objects into k clusters and each partition forms one cluster.
One common algorithm is CLARANS.
Correlation Coefficient
The Correlation Coefficient (r) describes the strength and direction of a linear relationship and x/y variables on a scatterplot.
The value of r is always between -1 and +1:
-1.00 | Perfect downhill | Negative linear relationship. |
-0.70 | Strong downhill | Negative linear relationship. |
-0.50 | Moderate downhill | Negative linear relationship. |
-0.30 | Weak downhill | Negative linear relationship. |
0 | No linear relationship. | |
+0.30 | Weak uphill | Positive linear relationship. |
+0.50 | Moderate uphill | Positive linear relationship. |
+0.70 | Strong uphill | Positive linear relationship. |
+1.00 | Perfect uphill | Positive linear relationship. |
Perfect Uphill +1.00:
Perfect Downhill -1.00:
Strong Uphill +0.61:
No Relationship: