What is clustering?
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs to only one group that has similar properties.
In simple words, clustering is a process of identifying clusters in the data. Sometimes we have a lot of data that we have to categorize in some or another way. But for doing this we don't have the parameters on which basis we can categorize our data and here comes the use of K-mean clustering.
K-mean clustering finds clusters in the data on the basis of the distance between the plotted points. This distance is usually Euclidean distance.
How does K-mean clustering works?
- Choose the number of K clusters.
- Select at random K points, the centroid.
- Assign each data point to the closest centroid. That forms K clusters.
- Compute and place the new centroid of each cluster.
- Reassign each data point to the new closest centroid. If any reassignment takes place go to step 4 otherwise your model is ready.
Step 1: Choose the number of K clusters. So here for this dataset, we have chosen K = 2.
Step 2: Select at random K points, the centroid. Here in this dataset, we are choosing two centroids for the two categories. These centroids can be anywhere and also it does not necessarily have to be from your actual data points.
Step 3: Assign each data point to the closest centroid. That forms K clusters. The closeness of the data points is usually measured by the euclidean distance between the points.
Step 4: Compute the centroid again for each category and place the new centroid of each cluster.
Step 5: Reassign each data point to the new closest centroid. If any reassignment takes place go to step 4 otherwise your model is ready.
Here as we can clearly see that some data points are reassigned to the red category so now we have to recalculate the centroid for each category and place the new centroid of each cluster which was our step 4.
Step 4: Recalculating the centroid for each category and placing the new centroid of each cluster.
Step 5: Reassigning each data point to the new closest centroid. And again some of the data points are reassigned to the blue category so now we will be again moving to step 4.
We will be performing these two steps until no data points are reassigned.
Now we can see that there is no reassignment taking place. So our clustering is finished now and we have two clusters here.
How to choose the right value for K
So as of now, we know that how the algorithm actually works but for the practical implementation of this algorithm on our various datasets there is one more thing that is very important. And that is the optimum value of K for the algorithm. So now the question arises How we can choose the right value of K for our Dataset.
And here comes the Elbow Method
The Elbow method uses WCSS ( Within-Cluster Sum of Square ) for calculating the optimum value of K. WCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease.
The K value corresponding to the elbow point is the optimal K value.
Use cases of K mean in Security Domain
1.Call record detail analysis
A call detail record (cdr) is the information captured by telecom companies during the call, SMS, and internet activity of a customer. This information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.
2. Automatic clustering of it alerts
Large enterprise infrastructure technology components such as network, storage, or database generate large volumes of alert messages. Because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.
Hope you have got enough knowledge about K-mean till now and how it can be used in the real world. I have tried to make you understand in the simplest form using visuals.
Thanks for reading!