PredictionOf BreastCancer

By Dewi Ayu Paraswati


This report descriptio of Breast Cancer diagnosis using Machine Learning Algorithms. I invistigated 4 Algorithms :
- Logistic Regression
- Decission Tree
- Random Forest
- Suport Vector Machine

The dataset used in this report is Breast Cancer Wisconsin hosted in Kaggle. Data Set can be download in :
here. or you can access the link (

Report OutLine :

  1. Data extraction
  2. Explonatory Data Analysis
  3. Data Preparation
  4. Modelling
  5. Evaluation
  6. Recommendation

1. Data Extraction

This code is for data extration from link dataset . I used the function of csv for extraction the data.

To see the number of rows and column, i used dim() function. The dataset has 569 rows and 33 columns.

2. Explonatory Data Analysis

This code is for expalin the data to data visualization. To find out the coloumns and types, i used the str() function.

From the result above , we know the following:
1. The first coloumn is id. It’s unique and unnecessary for prediction. so it should be removed. I used the code NULL.

2. The second column is diagnosis. This Should be a class variable. Currently the type is char and it should be converted to factor.

3. The last column is X. all be values are NA. So, it should be removed. I used the code NULL.

2.1 Univariate Data Analyst

Analyst of single variable.
Example : boxplot, histogram, pie-chart

Analyst of a single variable. Number of Benign(B) and Malignant(B) in colum diagnosis column in dataset breast cancer.

Distribution of radius mean variable in boxplot.

Distribution of radius mean varible in Histogram.

after we make the diagram diagram. so that we can easily see the data cleanly then we have to combine the diagrams using the function gridExtra(). Before we used the function we must activate the library first. so here the result :

2.2 Bivariate Data Analyst

Analyst two variable.
Example : pairplot, scatterplot, and point.

Analyst of two variables. Distribution of radius_mean variables based on diagnosis

So her the result for the diagram bivariate using the boxplot():

Analyst of two variables. Distribution of radius_mean variables based on diagnosis(with jitter and colour)

so here the result for making a diagram using the boxplot() and jitter():

Analyst all variable base on diagnosis using the diagram density or we can called geom_density() function.

so here the result for geom_density():

Observation based on radius mean and texture mean variable. Each point is a single observation.
the colour and shape of the observation are based on diagnosis (bening or malignant)

and here the result :

In General bening has lower radius meanand texture mean measurement that malignant. However, these two variables are not enough two separate the classes.

2.3 Multivariate Data Analyst

Analyst tree or more than variables. Example: corelation coefficient

Visualize Pearson’s Correlation Coefficient for *_mean variables.

Visualize Pearson’s Correlation Coefficient for *_se variables.

Visualize Pearson’s Correlation Coefficient for *_worst variables.

From the correlation coefficient, we can see that area, radius, and perimeter are co-linear. So, we need to remove two of them: area and perimeter.

We can also see that compactness, concavity, and concave points are so-clear. So, we need to remove two of them: compactness and concave.points.

3. Data Preparation

This code is for expalaination data presparation part.

3.1 Feature Selection

Remove *_worst variables. Based on discussion with domain expert, the all the variables with ending worst should be removed.

Remove area, perimeter, compactness, concavity.

3.2 Remove Outliers

3.3 Feature Scalling

3.4 PCA

3.5 Trainning and Test Division

Use set.seed() for reproducible result. Ratio train:test = 70:30

4. Modelling

In this part, we used 4 machine Learning Algorithms.

4.1 Logistic Regression

And the result will be :

4.2 Decision Tree

and result for plot of decision tree will be:

4.3 Random Forest

and result from the random forest :

4.4 Support Vector Machine (SVM)

and the result number of SVM is :

5. Evaluation

We compute accuracy, precision, recall, and F1 Score.

6. Recomendation

  1. Random forest algorithm is the best among all the tested algorithms.
  2. Based on decision tree model, the most important variables are concave.point, radius_mean, and texture_mean.
  3. The results can be improved by better data preparation or using other algorithms. However, the current results surpass human level performance (79% accuracy). So, it can be deployed as second opinion for the doctor.

learn by your self :)