Exercise: K-means and Hierarchical Clustering

The USArrests dataset is violent crime rates in the United States.

Overview of the USArrests Dataset:

Source: The data were originally collected and published in 1973 and include violent crime statistics for each of the 50 U.S. states.
Purpose: It provides information about arrests per 100,000 residents for different violent crimes, and it is often used to explore clustering techniques like K-Means, Hierarchical Clustering.

Structure of the Dataset:

The dataset has 50 observations (each representing a U.S. state) and 5 variables. The variables are:

1 State: State names.

Murder: Murder arrests per 100,000 residents.
Assault: Assault arrests per 100,000 residents.
UrbanPop: Percent of the population living in urban areas.
Rape: Rape arrests per 100,000 residents.

Example (first few rows of the dataset):

       State Murder Assault UrbanPop Rape
1    Alabama   13.2     236       58 21.2
2     Alaska   10.0     263       48 44.5
3    Arizona    8.1     294       80 31.0
4   Arkansas    8.8     190       50 19.5
5 California    9.0     276       91 40.6
6   Colorado    7.9     204       78 38.7

Key Insights:

The dataset allows analysis of crime rates across states and shows how violent crime, urbanization, and arrest rates are related.
You can use it to perform various clustering techniques to group states based on their crime profiles.

Typical Applications:

Clustering: To identify groups of states with similar crime rates.
Visualization: Create plots (e.g., scatter plots, heatmaps) to understand the relationships between crime rates and urbanization.

The USArrests dataset is simple yet rich in information, making it a great resource for exploring clustering and dimensionality reduction techniques in data science.

Download Excel file and Orange file

K-means

Double-click on the widget ‘Select Columns’. Which variable is in the metas?

ANSWER:

Double-click on the widget ‘k-Means’. We select the number of clusters from 3 to 8. Based on the Silhouette scores, what is the number of clusters we should use?

ANSWER:

Silhouette values range from -1 to 1:

1: The data point is very well clustered (it is far from neighboring clusters and close to its own cluster).
0: The data point lies on the boundary between two clusters.
Negative value (< 0): The data point is closer to a neighboring cluster than to the points in its own cluster.

A negative Silhouette value indicates that a data point may be incorrectly clustered, or that it is closer to points in another cluster than to points in its own cluster.

Double click at Silhouette Plot, which states is not clear for cluster 1 or cluster 2.

ANSWER:

Double-click on the Scatter widget connected to the k-Means widget. If we could choose to live in the USA in 1975 based on crime rates, what is the name of the state we should live in?

ANSWER:

Hierarchical Clustering

Double-click on the Distances widget. Which distance metric is used in the analysis?

ANSWER:

Double-click at Hierarchical Clustering widget, what is the linkage name we use?

ANSWER:

From the Hierarchical Clustering widget with 2 clusters, compare the list to identify which pair of states has the shortest distance metric.

ANSWER:

ANSWER:

From Hierarchical Clustering

We decide to use 2 clusters, the explanation of the results would revolve around understanding how the U.S. states are grouped based on the crime statistics (Murder, Assault, UrbanPop, and Rape).

How to Explain the 2 Clusters:

Cluster 1: This group could represent states with lower crime rates, indicating that the states in this cluster are less prone to violent crimes or are less urbanized.
Cluster 2: This group of states could have higher crime rates for Murder, Assault, and/or Rape, or they may be more urbanized based on the UrbanPop feature. These states may have higher violent crime activity overall.