PredictionOf BreastCancer

7 min readMar 10, 2021

By Dewi Ayu Paraswati

Description

This report descriptio of Breast Cancer diagnosis using Machine Learning Algorithms. I invistigated 4 Algorithms :
- Logistic Regression
- Decission Tree
- Random Forest
- Suport Vector Machine

The dataset used in this report is Breast Cancer Wisconsin hosted in Kaggle. Data Set can be download in :
here. or you can access the link (https://www.kaggle.com/buddhiniw/breast-cancer-prediction)

Report OutLine :

Data extraction
Explonatory Data Analysis
Data Preparation
Modelling
Evaluation
Recommendation

1. Data Extraction

This code is for data extration from link dataset . I used the function of csv for extraction the data.

library(readr)
bcw_df <- read.csv("data/data.csv")

To see the number of rows and column, i used dim() function. The dataset has 569 rows and 33 columns.

dim(bcw_df)## [1] 569  33

2. Explonatory Data Analysis

This code is for expalin the data to data visualization. To find out the coloumns and types, i used the str() function.

str(bcw_df)## 'data.frame':    569 obs. of  33 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...
##  $ X                      : logi  NA NA NA NA NA NA ...

From the result above , we know the following:
1. The first coloumn is id. It’s unique and unnecessary for prediction. so it should be removed. I used the code NULL.

bcw_df$id <- NULL

2. The second column is diagnosis. This Should be a class variable. Currently the type is char and it should be converted to factor.

bcw_df$diagnosis <- as.factor(bcw_df$diagnosis)

3. The last column is X. all be values are NA. So, it should be removed. I used the code NULL.

bcw_df$X <- NULL

2.1 Univariate Data Analyst

Analyst of single variable.
Example : boxplot, histogram, pie-chart

Analyst of a single variable. Number of Benign(B) and Malignant(B) in colum diagnosis column in dataset breast cancer.

library(ggplot2)ggplot(data = bcw_df, aes(x=diagnosis)) +geom_bar()

Distribution of radius mean variable in boxplot.

ggplot(data = bcw_df,aes(y=radius_mean))+
  geom_boxplot()+
  labs(title = "Breast Cancer Wisconsin Data", y= "Radius Mean")

Distribution of radius mean varible in Histogram.

ggplot(data = bcw_df, aes(x=radius_mean))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

after we make the diagram diagram. so that we can easily see the data cleanly then we have to combine the diagrams using the function gridExtra(). Before we used the function we must activate the library first. so here the result :

2.2 Bivariate Data Analyst

Analyst two variable.
Example : pairplot, scatterplot, and point.

Analyst of two variables. Distribution of radius_mean variables based on diagnosis

ggplot(data = bcw_df, aes(x=diagnosis, y = radius_mean))+
  geom_boxplot()+
  labs(title = "Breast Cancer Winconsin Data", x= "Diagnosis", y="Radius Mean")

So her the result for the diagram bivariate using the boxplot():

Analyst of two variables. Distribution of radius_mean variables based on diagnosis(with jitter and colour)

ggplot(data = bcw_df, aes(x=diagnosis, y = radius_mean))+
  geom_boxplot()+
  geom_jitter(alpha= 0.3,
              color ="blue",
              width= 0.2)+
  labs(title = "Breast Cancer Winconsin Data", x= "Diagnosis", y="Radius Mean")

so here the result for making a diagram using the boxplot() and jitter():

Analyst all variable base on diagnosis using the diagram density or we can called geom_density() function.

ggplot(data = bcw_df, aes(x=radius_mean, fill=diagnosis))+
  geom_density(alpha=.3)

so here the result for geom_density():

Observation based on radius mean and texture mean variable. Each point is a single observation.
the colour and shape of the observation are based on diagnosis (bening or malignant)

and here the result :

In General bening has lower radius meanand texture mean measurement that malignant. However, these two variables are not enough two separate the classes.

2.3 Multivariate Data Analyst

Analyst tree or more than variables. Example: corelation coefficient

Visualize Pearson’s Correlation Coefficient for *_mean variables.

##install.packages("corrgram")
library(corrgram)
corrgram(bcw_df[2:11], order = TRUE,
         upper.panel = panel.pie)

Visualize Pearson’s Correlation Coefficient for *_se variables.

##install.packages("corrgram")
library(corrgram)
corrgram(bcw_df[12:21], order = TRUE,
         upper.panel = panel.pie)

Visualize Pearson’s Correlation Coefficient for *_worst variables.

##install.packages("corrgram")
library(corrgram)
corrgram(bcw_df[22:31], order = TRUE,
         upper.panel = panel.pie)

From the correlation coefficient, we can see that area, radius, and perimeter are co-linear. So, we need to remove two of them: area and perimeter.

We can also see that compactness, concavity, and concave points are so-clear. So, we need to remove two of them: compactness and concave.points.

3. Data Preparation

This code is for expalaination data presparation part.

3.1 Feature Selection

Remove *_worst variables. Based on discussion with domain expert, the all the variables with ending worst should be removed.

Remove area, perimeter, compactness, concavity.

3.2 Remove Outliers

3.3 Feature Scalling

3.4 PCA

3.5 Trainning and Test Division

Use set.seed() for reproducible result. Ratio train:test = 70:30

4. Modelling

In this part, we used 4 machine Learning Algorithms.

4.1 Logistic Regression

And the result will be :

4.2 Decision Tree

and result for plot of decision tree will be:

4.3 Random Forest

and result from the random forest :

4.4 Support Vector Machine (SVM)

and the result number of SVM is :

5. Evaluation

We compute accuracy, precision, recall, and F1 Score.

6. Recomendation

Random forest algorithm is the best among all the tested algorithms.
Based on decision tree model, the most important variables are concave.point, radius_mean, and texture_mean.
The results can be improved by better data preparation or using other algorithms. However, the current results surpass human level performance (79% accuracy). So, it can be deployed as second opinion for the doctor.