— R, Classification — 2 min read
Share
Detecting breast (or any other type of) cancer before noticing symptoms is a key first step in fighting the disease. The process involves examining breast tissue for lumps or masses. Fine needle aspirate (FNA) biopsy is performed if such irregularity is found. The extracted tissue is then examined under a microscope by a clinician.
Can a machine help the clinician do a better job? Can the doctor focus more on treating the disease rather than detecting it? Recently, Deep Learning (DL) has seen major advances in the area of computer vision. Naturally, some scientists tried to apply it to breast cancer detection - and did so with great success!
Here, we will look at a dataset created by Dr. William H. Wolberg, W. Nick Street and Olvi L. Mangasarian from the University of Wisconsin. Each row describes features of the cell nuclei present in the digitized image of the FNA along with the diagnosis (M = malignant, B = benign) and ID of a patient with a lump in her breast.
Here is a list of the measured cell nuclei features:
Can we predict whether the lump is benign or malignant?
Sample image from which the cell nuclei features are extracted
1library(ggplot2)2library(Amelia)3library(class)4library(gmodels)56set.seed(42)
1df <- read.csv("data/breast_cancer.csv", stringsAsFactors = FALSE)
1print(paste("rows:", nrow(df), "cols:", ncol(df)))
1[1] "rows: 569 cols: 32"
Let’s remove the ID column and recode the diagnosis.
1df <- df[-1]2df$diagnosis <- factor(df$diagnosis, levels = c("B", "M"),3 labels = c("Benign", "Malignant"))
Do we have missing data?
1missmap(df, main="Missing Data Map", col=c("#FF4081", "#3F51B5"),2 legend=FALSE)
Nope. That’s good! What is the distribution for the both types of cancer?
1barplot(table(df$diagnosis), xlab = "Type of tumor", ylab="Numbers per type")
Let’s see if we can differentiate between tumor types using some features (randomly chosen?):
1qplot(radius_mean, data=df, colour=diagnosis, geom="density",2 main="Radius mean for each tumor type")
1qplot(smoothness_mean, data=df, colour=diagnosis, geom="density",2 main="Smoothness mean for each tumor type")
1qplot(concavity_mean, data=df, colour=diagnosis, geom="density",2 main="Concavity mean for each tumor type")
Let’s normalize (scale every value in our dataset in the range [0:1]) our data. This will become handy when we try to classify the tumor type for each patient.
1normalize <- function(x) {2 return ((x - min(x)) / (max(x) - min(x)))3}45df_normalized <- as.data.frame(lapply(df[2:31], normalize))
Additionaly, let’s create a scaled version of our dataset too! The formula for scaling is the following:
σ(x)x−mean(x)
where x is a vector that contains real numbers.
1df_scaled <- as.data.frame(scale(df[-1]))
Now, let’s split our dataset into 3 new - training, test and validation. First, let’s put aside 150 rows for test/validation and use the rest for training:
1train_idx <- sample(nrow(df_normalized), nrow(df_normalized) - 150,2 replace = FALSE)3df_normalized_train <- df_normalized[train_idx, ]
Let’s use 100 of the rest for testing and 50 for validation:
1test_validation_idx <- seq(1:nrow(df_normalized))[-train_idx]2test_idx <- sample(test_validation_idx, 100, replace = FALSE)3validation_idx <- test_validation_idx[-test_idx]45df_normalized_test <- df_normalized[test_idx, ]6df_normalized_validation <- df_normalized[validation_idx, ]
We will use simple k-means clustering algorithm to predict whether a patient has a benign or malignant tumor.
1df_train_labels <- df[train_idx, 1]2df_test_labels <- df[test_idx, 1]3df_validation_labels <- df[validation_idx, 1]45df_normalized_pred_labels <- knn(train = df_normalized_train,6 test = df_normalized_test,7 cl = df_train_labels,8 k = 21)
Ok, that was quick. How did we do? Let’s evaluate our model using a cross table and see:
1evaluate_model <- function(expected_labels, predicted_labels) {2 CrossTable(x = expected_labels, y = predicted_labels, prop.chisq=FALSE)3 true_predctions <- table(expected_labels == predicted_labels)["TRUE"]4 correct_predictions <- true_predictions / length(predicted_labels)5 print(paste("Correctly predicted: ", correct_predictions))6}
1evaluate_model(df_test_labels, df_normalized_pred_labels)
1Cell Contents2|-------------------------|3| N |4| N / Row Total |5| N / Col Total |6| N / Table Total |7|-------------------------|8910Total Observations in Table: 100111213 | predicted_labels14expected_labels | Benign | Malignant | Row Total |15----------------|-----------|-----------|-----------|16 Benign | 60 | 0 | 60 |17 | 1.000 | 0.000 | 0.600 |18 | 0.952 | 0.000 | |19 | 0.600 | 0.000 | |20----------------|-----------|-----------|-----------|21 Malignant | 3 | 37 | 40 |22 | 0.075 | 0.925 | 0.400 |23 | 0.048 | 1.000 | |24 | 0.030 | 0.370 | |25----------------|-----------|-----------|-----------|26 Column Total | 63 | 37 | 100 |27 | 0.630 | 0.370 | |28----------------|-----------|-----------|-----------|293031[1] "Correctly predicted: 0.97"
Not bad, only 3 errors. Can we do better? Let’s use our scaled dataset:
1df_scaled_train <- df_scaled[train_idx, ]2df_scaled_test <- df_scaled[test_idx, ]3df_scaled_validation <- df_scaled[validation_idx, ]
1df_scaled_pred_labels <- knn(train = df_scaled_train,2 test = df_scaled_test,3 cl = df_train_labels,4 k = 21)
1evaluate_model(df_test_labels, df_scaled_pred_labels)
1Cell Contents2|-------------------------|3| N |4| N / Row Total |5| N / Col Total |6| N / Table Total |7|-------------------------|8910Total Observations in Table: 100111213 | predicted_labels14expected_labels | Benign | Malignant | Row Total |15----------------|-----------|-----------|-----------|16 Benign | 60 | 0 | 60 |17 | 1.000 | 0.000 | 0.600 |18 | 0.938 | 0.000 | |19 | 0.600 | 0.000 | |20----------------|-----------|-----------|-----------|21 Malignant | 4 | 36 | 40 |22 | 0.100 | 0.900 | 0.400 |23 | 0.062 | 1.000 | |24 | 0.040 | 0.360 | |25----------------|-----------|-----------|-----------|26 Column Total | 64 | 36 | 100 |27 | 0.640 | 0.360 | |28----------------|-----------|-----------|-----------|293031[1] "Correctly predicted: 0.96"
Huh, even worse! Let’s try different k values:
1train_and_evaluate <- function(train, test, train_labels, test_labels, k) {2 predicted_labels <- knn(train = train, test = test,3 cl = train_labels, k = k)4 evaluate_model(test_labels, predicted_labels)5}
1train_and_evaluate(df_normalized_train, df_normalized_test,2 df_train_labels, df_test_labels, 1)
1Cell Contents2|-------------------------|3| N |4| N / Row Total |5| N / Col Total |6| N / Table Total |7|-------------------------|8910Total Observations in Table: 100111213 | predicted_labels14expected_labels | Benign | Malignant | Row Total |15----------------|-----------|-----------|-----------|16 Benign | 60 | 0 | 60 |17 | 1.000 | 0.000 | 0.600 |18 | 0.952 | 0.000 | |19 | 0.600 | 0.000 | |20----------------|-----------|-----------|-----------|21 Malignant | 3 | 37 | 40 |22 | 0.075 | 0.925 | 0.400 |23 | 0.048 | 1.000 | |24 | 0.030 | 0.370 | |25----------------|-----------|-----------|-----------|26 Column Total | 63 | 37 | 100 |27 | 0.630 | 0.370 | |28----------------|-----------|-----------|-----------|293031[1] "Correctly predicted: 0.97"
1train_and_evaluate(df_normalized_train, df_normalized_test,2 df_train_labels, df_test_labels, 5)
1Cell Contents2|-------------------------|3| N |4| N / Row Total |5| N / Col Total |6| N / Table Total |7|-------------------------|8910Total Observations in Table: 100111213 | predicted_labels14expected_labels | Benign | Malignant | Row Total |15----------------|-----------|-----------|-----------|16 Benign | 60 | 0 | 60 |17 | 1.000 | 0.000 | 0.600 |18 | 0.952 | 0.000 | |19 | 0.600 | 0.000 | |20----------------|-----------|-----------|-----------|21 Malignant | 3 | 37 | 40 |22 | 0.075 | 0.925 | 0.400 |23 | 0.048 | 1.000 | |24 | 0.030 | 0.370 | |25----------------|-----------|-----------|-----------|26 Column Total | 63 | 37 | 100 |27 | 0.630 | 0.370 | |28----------------|-----------|-----------|-----------|293031[1] "Correctly predicted: 0.97"
1train_and_evaluate(df_normalized_train, df_normalized_test,2 df_train_labels, df_test_labels, 15)
1Cell Contents2|-------------------------|3| N |4| N / Row Total |5| N / Col Total |6| N / Table Total |7|-------------------------|8910Total Observations in Table: 100111213 | predicted_labels14expected_labels | Benign | Malignant | Row Total |15----------------|-----------|-----------|-----------|16 Benign | 60 | 0 | 60 |17 | 1.000 | 0.000 | 0.600 |18 | 0.952 | 0.000 | |19 | 0.600 | 0.000 | |20----------------|-----------|-----------|-----------|21 Malignant | 3 | 37 | 40 |22 | 0.075 | 0.925 | 0.400 |23 | 0.048 | 1.000 | |24 | 0.030 | 0.370 | |25----------------|-----------|-----------|-----------|26 Column Total | 63 | 37 | 100 |27 | 0.630 | 0.370 | |28----------------|-----------|-----------|-----------|293031[1] "Correctly predicted: 0.97"
Not much change. Let’s see how our model performs on the validation set:
1train_and_evaluate(df_normalized_train, df_normalized_validation,2 df_train_labels, df_validation_labels, 21)
1Cell Contents2|-------------------------|3| N |4| N / Row Total |5| N / Col Total |6| N / Table Total |7|-------------------------|8910Total Observations in Table: 127111213 | predicted_labels14expected_labels | Benign | Malignant | Row Total |15----------------|-----------|-----------|-----------|16 Benign | 79 | 0 | 79 |17 | 1.000 | 0.000 | 0.622 |18 | 0.929 | 0.000 | |19 | 0.622 | 0.000 | |20----------------|-----------|-----------|-----------|21 Malignant | 6 | 42 | 48 |22 | 0.125 | 0.875 | 0.378 |23 | 0.071 | 1.000 | |24 | 0.047 | 0.331 | |25----------------|-----------|-----------|-----------|26 Column Total | 85 | 42 | 127 |27 | 0.669 | 0.331 | |28----------------|-----------|-----------|-----------|293031[1] "Correctly predicted: 0.952755905511811"
Our final accuracy is about 95%. What does this mean? If our model was to replace a doctor it would missclassify 6 malignant tumors as benign. This is bad! The other type of error (missclassifying benign tumor as malignant) is pretty bad too! So, improvement to the accuracy (in any way) might save lives! Can you improve the model?
P. S. This post was written as an ipython notebook. Download it from here. The dataset can be download from here.
Share
You'll never get spam from me