Phishing Websites Detection Using Machine Learning
Article · September 2019
DOI: 10.35940/ijrte.B1018.0982S1119
1 author:
Akila D.
Saveetha College of Liberal Arts and Sciences
Abstract— Phishing is a common attack on credulous people by
making them to disclose their unique information using counterfeit
websites. The objective of phishing website URLs is to purloin the
personal information like user name, passwords and online
banking transactions. Phishers use the websites which are visually
and semantically similar to those real websites. As technology
continues to grow, phishing techniques started to progress rapidly
and this needs to be prevented by using anti-phishing mechanisms
to detect phishing. Machine learning is a powerful tool used to
strive against phishing attacks. This paper surveys the features used
for detection and detection techniques using machine learning.
Keywords— Phishing, Phishing Websites, Detection, Machine
Phishing is the most unsafe criminal exercises in cyber
space. Since most of the users go online to access the services
provided by government and financial institutions, there has
been a significant increase in phishing attacks for the past few
years. Phishers started to earn money and they are doing this
as a successful business. Various methods are used by
phishers to attack the vulnerable users such as messaging,
VOIP, spoofed link and counterfeit websites. It is very easy to
create counterfeit websites, which looks like a genuine website
in terms of layout and content. Even, the content of these
websites would be identical to their legitimate websites. The
reasonfor creating these websites is to get private data from
users like account numbers, login id, passwords of debit and
credit card, etc. Moreover, attackers ask security questions to
answer to posing as a high level security measure providing to
users. When users respond to those questions, they get easily
trapped into phishing attacks. Many researches have been
going on to prevent phishing attacks by different communities
around the world. Phishing attacks can be prevented by
detecting the websites and creating awareness to users to
identify the phishing websites. Machine learning algorithms
have been one of the powerful techniques in detecting
phishing websites. In this study, various methods of detecting
phishing websites have been discussed.
Authors in this paper[1] explained a novel approach to
detect phishing websites using machine learning algorithms.
They also compared the accuracy of five machine learning
algorithms Decision Tree (DT), Random Forest (RF)[1],
Gradient Boosting (GBM), Generalized Linear Model (GLM)
and Generalized Additive Model (GAM)[1]. Accuracy,
Precision and Recall evaluation methods were calculated for
each algorithm and compared. Website attributes (30) are
extracted with the help of Python and performance evaluation
done with open source programming language R. Top three
algorithms namely Decision Tree, Random Forest and GBM
performance were compared in table. From the tables of
accuracy, recall and performance, it is shown that Random
Forest algorithm has given highest 98.4% accuracy, 98.59%
recall and 97.70% precision.
In this paper authors [2] proposes a classification mode[2]l
in order to classify the phishing attacks. This model comprises
of feature extraction from sites and classification of website.
In feature extraction, 30 features has been taken from UCI
Irvine machine learning repository data set and phishing
feature extraction rules has been clearly defined. In order to
classification of these features, Support Vector Machine
(SVM), Naïve Bayes (NB) and Extreme Learning Machine
(ELM)[2] were used. In Extreme Learning Machine (ELM),
six activation functions were used and achieved 95.34%
accuracy than SVM and NB. The results were obtained with
the help of MATLAB.
Authors [3] presents an approach to detect phishing email
attacks using natural language processing and machine
learning. This is used to perform the semantic analysis of the
text to detect malicious intent. A natural Language Processing
(NLP) technique is usedto parse each sentence and finds the
semantic jobs of words in the sentence in connection to the
predicate. In light of the job of each word in the sentence, this
strategy recognizes whether the sentence is an inquiry or an
order. Supervised machine learning[3] is used to generate the
blacklist of malicious pairs. Authors defined algorithm
SEAHound[3] for detecting phishing emails and Netcraft
Anti-Phishing Toolbar is used to verify the validity of a URL.
This algorithm is implemented with Python scripts and dataset
Nazario phishing email set is used. Results of Netcraft and
SEAHound[3] are compared and obtained precision 98% and
95% respectively.
Phishing Websites Detection Using Machine
R. Kiruthiga, D. Akila
This result demonstrates that semantic data is a solid pointer
of social designing.
Another approach by authors [4] proposes feature selection
algorithms to decrease the components of dataset to get higher
order execution [4]. It also compared with other data mining
classification algorithms and results obtained. Dataset for
phishing websites was taken from UCI machine learning
repository[4]. From the outcomes, it is seen that some
classification strategies increment the execution; some of them
decline the execution with decreased component. Bayesian
Network, Stochastic Gradient Descent (SGD), lazy.K.Star,
Randomizable Filtered Classifier, Logistic model tree (LMT)
and ID3 (Iterative Dichotomiser)[4] are useful for reduce
phishing dataset and Multilayer Perception, JRip, PART,
J48[4], Random Forest and Random Tree algorithms are not
valuable for the diminished phishing dataset. Lazy.K.Star
obtained 97.58% accuracy with 27 reduced features. This
study is obtained with the help of WEKA software.
Authors [5]proposed a model with answer for recognize
phishing sites by utilizing URL identification strategy utilizing
Random Forest algorithm. Show has three stages, namely
Parsing, Heuristic Classification of data, Performance
Analysis [5]. Parsing is used to analyze feature set. Dataset
gathered from Phishtank. Out of 31 features only 8 features
are considered for parsing. Random forest method obtained
accuracy level of 95%.
Authors [6] proposed a flexible filtering decision module to
extract features automatically without any specific expert
knowledge of the URL domain using neural network model.
In this approach authors used all the characters included in the
URL strings and count byte values. They not only count byte
values and also overlap parts of neighbouring characters by
shifting 4-bits. They embed combination information of two
characters appearing sequentially and counts how many times
each value appears in the original URL string and achieves a
512 dimension vector. Neural network model tested with three
optimizers Adam, AdaDelta and SGD. Adam was the best
optimizer with accuracy 94.18% than others. Authors also
conclude that this model accuracy is higher than the
previously proposed complex neural network topology.
In this paper authors [7] made a comparative study to detect
malicious URL with classical machine learning technique –
logistic regression using bigram, deep learning techniques like
convolution neural network (CNN) and CNN long short-term
memory (CNN-LSTM)[7] as architecture. The dataset
collected from Phishtank, OpenPhish for phishing URLs and
dataset MalwareDomainlist, MalwareDomains were collected
for malicious URLs. As a result of comparison, CNN-LSTM
obtained 98% accuracy. In this paper authors used
TensorFlow[7] in conjuction with Keras[7] for deep learning
Authors in this paper [8] also proposed reduced feature
selection model to detect phishing websites. They used
Logistic Regression and Support Vector Machine (SVM)[8] as
classification methods to validate the feature selection method.
19 features reduced from 30 site features have been selected
and used for phishing detection. The LR and SVM
calculations performance was surveyed dependent on
precision, recall, f-measure and accuracy. Study shows that
SVM algorithm achieved best performance over LR
In this paper authors [9] proposed a phishing detection
model to detect the phishing performance effectively by using
mining the semantic features of word embedding, semantic
feature and multi-scale statistical features[9] in Chinese web
pages. Eleven features were extracted and categorized into
five classes to acquire statistical features of web pages.
AdaBoost, Bagging, Random Forest and SMO[9] are used to
implement learning and testing the model. Legitimate URLs
dataset obtained from DirectIndustry web guides and phishing
data was obtained from Anti-Phishing Alliance of China.
According to study, only semantic features well identified the
phishing sites with high detection[9] efficiency and fusion
model achieved the best performance detection. This model is
unique to Chinese web pages and it has dependency in certain
This paper [10] proposes a efficient way to detect phishing
URL websites by using c4.5 decision tree approach. This
technique extracts features from the sites and calculates
heuristic values. These values were given to the c4.5 decision
tree algorithm[10] to determine whether the site is phishing or
not. Dataset is collected from PhishTank and Google. This
process includes two phases namely pre-processing phase and
detection phase[10]. In which features are extracted based on
rules in pre-processing phase and the features and their
respected values were inputted to the c4.5 algorithm and
obtained 89.40% accuracy.
Authors [11] in this paper created an extension to Google
Chrome to detect phishing websites content with the help of
machine learning algorithms. Dataset UCI-Machine Learning
Repository used and 22 features were extracted for this
dataset. Algorithms kNN, SVM and Random Forest were
chosen for precision, recall,f1-score and accuracy comparison.
Random Forest obtained a best score and HTML,JavaScript,
CSS[11] used for implementing chrome extension along with
python. This extension is having a drawback of declared
malicious site list which is increasing every day.
This paper [12] approaches a framework to extract features
flexible and simple with new strategies. Data is collected from
PhishTank[12] and legitimate URLs from Google[12]. To
obtain the text properties C# programming and R
programming were used. 133 features were obtained from the
dataset and third party service providers. CFS subset based
and Consistency subset based feature selection[12] methods
used for feature selection and analyzed with WEKA tool.
Naïve Bayes and Sequential Minimal Optimization
(SMO)[12] algorithms were compared for performance
evaluation and SMO is preferred by the author for phishing
detection than NB.
Another heuristic features detection method by authors [13]
explains about the feature of URL such as PrimaryDomain,
SubDomain, PathDomain and ranking of website such as
PageRank, AlexaRank, AlexReputation to identify the
phishing websites. Dataset used from PhishTank and
experimental is splitted into 6 phases through MYSQL, PHP
with 10 testing datasets. The proposed model contains two
phases. In Phase I site features were extracted and in Phase II
six values of heuristic are calculated. According to authors, if
heuristic value is nearest to one, the site is considered as
legitimate and if it is nearest to zero then the site is doubted as
phishing site. Root Mean Square Error (RMSE)[13] is used to
calculate accuracy and obtained 97% accuracy.
In this paper author [14] introduces a phishing URL
detection system depends on URL lexical analysis named
PhishScore. This approach is based on intra-URL
relatedness[14][18]. This relatedness reflects the relationship
into part of the URLRight around 12 site highlights removed
from a solitary URL are utilized to include machine learning
algorithms to identify phishing URLs. This experiment results
accuracy of 94.91%.
This paper [15] focuses on detecting phishing website
URLs with domain name features. Web spoofing attack
categories content-based, heuristic-based and blacklist-based
approaches[8][17] are explained and the proposed model
PhishChecker is developed with the help of Microsoft Visual
Studio Express 2013 and C# language[15]. Dataset used from
Phishtank and Yahoo directory set and obtained an accuracy
of 96%. This paper checks only the validity of URLs.
Table 1: Outline of Algorithms used to detect Phishing Website URLs
This survey presented various algorithms and approaches to
detect phishing websites by several researchers in Machine
Learning. On reviewing the papers, we came to a conclusion
that most of the work done by using familiar machine learning
algorithms like Naïve Bayesian, SVM, Decision Tree and
Random Forest. Some authors proposed a new system like
PhishScore and PhishChecker for detection. The combinations
of features with regards to accuracy, precision, recall etc. were
used. Experimentally successful techniques in detecting
phishing website URLs were summarized in Table 1. As
phishing websites increases day by day, some features may be
included or replaced with new ones to detect them.
