Chapter 4 Probabilistic Learning Classification Using Nave Bayes probabilistic methods use data on past events to extrapolate future events E.g., the chance of rain describes the proportion of prior days to similar measurable atmospheric conditions in which precipitation occurred Topics of the chapter

Basic principles of probability The specialized methods and data structures needed to analyze text data with R How to employ Naive Bayes to build an SMS junk message filter Understanding Naive Bayes technique descended from the work of the 18th century mathematician Thomas Bayes Bayesian methods - foundational principles to describe the probability of events

probability is a number between 0 and 1 (that is, between 0 percent and 100 percent) The idea: utilize training data to calculate an observed probability of each outcome based on the evidence provided by feature values When the classifier is later applied to unlabeled data, it uses the observed probabilities to predict the most likely class for the new features Suitable areas Success in:

Text classification, such as junk e-mail (spam) filtering Intrusion or anomaly detection in computer networks Diagnosing medical conditions given a set of observed symptoms best applied to problems in which the information from numerous attributes should be considered simultaneously in order to estimate the overall probability of an outcome Bayesian methods utilize all the available evidence to subtly change the predictions

Basic concepts of Bayesian methods Bayesian probability theory is rooted in the idea that the estimated likelihood of an event, or a potential outcome, should be based on the evidence at hand across multiple trials, or opportunities for the event to occur. Event and trial Understanding probability The probability of an event is estimated from the observed data by

dividing the number of trials in which the event occurred by the total number of trials E.g., if it rained 3 out of 10 days with similar conditions as today, the probability of rain today can be estimated as 3 / 10 = 0.30 - P(rain) = 0.30 E.g., if 10 out of 50 prior email messages were spam, then the probability of any incoming message being spam can be estimated as 10 / 50 = 0.20 P(spam) = 0.20 The probability of all the possible outcomes of a trial must always sum to 1 - mutually exclusive and exhaustive events

complement Complement (Ac or A) - the event comprising of the outcomes in which the event of interest (A) does not happen P(A) = P(Ac) P(spam) and P(spam) Understanding joint probability monitoring several nonmutually exclusive events for the same trial If certain events occur with the event of interest, we may be able to use them to make predictions

Venn diagram intersection of two events P(spam Viagra) - the probability that both P(spam) and P(Viagra) occur A B refers to the event in which both A and B occur joint probability - how the probability of one event is related to the probability of the other independent events two events are totally unrelated - knowing the

outcome of one event does not provide any information about the outcome of the other E.g., the outcome of a heads result on a coin flip is independent from whether the weather is rainy or sunny on any given day dependent events E.g., the presence of clouds is predictive of a rainy day, E.g., the appearance of the word Viagra is predictive of a spam e-mail Calculating the probability of dependent

events For independent events A and B, the probability of both happening can be expressed as P(A B) = P(A) * P(B). Computing conditional probability with Bayes theorem P(A|B) is read as the probability of event A, given that event B occurred conditional probability: the probability of A is dependent on what happened with event B

Another formulation of Bayes' theorem P(A B) = P(A|B) * P(B) P(A B) = P(B|A) * P(A) because P(A B) = P(B A) prior probability, e.g., P(spam), the probability that any prior message was spam Likelihood, e.g., P(Viagra|spam), probability that the word Viagra was used in previous spam messages marginal likelihood, e.g., P(Viagra), probability that Viagra appeared in any

message at all Posterior probability how likely the message is to be spam Frequency table & likelihood table Frequency table left

one dimension of the table indicates levels of the class variable (spam or ham) the other dimension indicates levels for features (Viagra: yes or no) Likelihood table right rows of the likelihood table indicate the conditional probabilities for Viagra (yes/no), given that an e-mail was either spam or ham Computations P(Viagra=Yes|spam) = 4/20 = 0.20

P(spam Viagra) = P(Viagra|spam) * P(spam) = (4/20) * (20/100) = 0.04 The same result can be found in the frequency table four times greater than the previous estimate of 0.01 we calculated as P(A B) = P(A) * P(B) under the false assumption of independence posterior probability, P(spam|Viagra) P(Viagra|spam) * P(spam) / P(Viagra) = (4/20) * (20/100) / (5/100) = 0.80 The Naive Bayes algorithm

the most common machine learning method that utilizes Bayesian methods de facto standard for text classification "naive assumptions assumes that all of the features in the dataset are equally important and independent - rarely true in most real-world applications E.g., the e-mail sender may be a more important indicator of spam than the message text E.g., the appearance of some words is a very good indication that other words are

also likely to appear Naive Bayes still performs fairly well in most cases when these assumptions are violated true even in extreme circumstances where strong dependencies are found among the features One explanation: it is not important to obtain a precise estimate of probability, so long as the predictions are accurate Classification with Naive Bayes

adding a few additional terms to be monitored in addition to the term Viagra: Money, Groceries, and Unsubscribe The Naive Bayes learner is trained by constructing a likelihood table for the appearance of these four words (labeled W1, W2, W3, and W4) As new messages are received, we need to calculate the posterior probability to determine whether they are more likely to be spam or ham, given the likelihood of the words found in the message text The problem to solve

the probability that a message is spam, given that Viagra = Yes, Money = No, Groceries = No, and Unsubscribe = Yes this formula is computationally difficult to solve - tremendous amounts of memory are needed to store probabilities for all of the possible intersecting events class-conditional independence: events are independent so long as they are conditioned on the same class value Simplified formular

Note: Because the denominator does not depend on the class (spam or ham), it is treated as a constant value and can be ignored for the time being equals symbol has been replaced by the proportional-to symbol overall likelihood of spam: (4/20) * (10/20) * (20/20) * (12/20) * (20/100) = 0.012 overall likelihood of ham: (1/80) * (66/80) * (71/80) * (23/80) * (80/100) = 0.002 probability of spam: 0.012/(0.012 + 0.002) = 0.857 probability of ham: 0.002/(0.012 + 0.002) = 0.143

The Naive Bayes classification algorithm The probability of level L for class C, given the evidence provided by features F1 through Fn, is equal to the product of the probabilities of each piece of evidence conditioned on the class level, the prior probability of the class level, and a scaling factor 1/Z, which converts the likelihood values into probabilities The Laplace estimator

Suppose that we received another message, this time containing all four terms: Viagra, Groceries, Money, and Unsubscribe the likelihood of spam: (4/20) * (10/20) * (0/20) * (12/20) * (20/100) = 0 likelihood of ham: (1/80) * (14/80) * (8/80) * (23/80) * (80/100) = 0.00005 very likely that the message has been incorrectly classified!

This problem arises if an event never occurs for one or more levels of the class, e.g., P(spam|groceries) = 0%. the absence of the word Groceries in spam will always veto the other evidence and result in the probability of spam being zero The solution Laplace estimator named after the French mathematician Pierre-Simon Laplace adds a small number to each of the counts in the frequency table Typically, the Laplace estimator is set to 1

ensures that each class-feature combination is found in the data at least once The Laplace estimator can be set to any value and does not necessarily have to be the same for each of the features. you could use a Laplace estimator to reflect a presumed prior probability of how the feature relates to the class. In practice, given a large enough training dataset, this step is unnecessary and the value of 1 is almost always used how this affects our prediction for this message

add one to each numerator in the likelihood function The total number of 1 values must also be added to each conditional probability denominator The likelihood of spam is: (5/24) * (11/24) * (1/24) * (13/24) * (20/100) = 0.0004 The likelihood of ham is: (2/84) * (15/84) * (9/84) * (24/84) * (80/100) = 0.0001 This means that the probability of spam is 80 percent, and the probability of ham is 20 percent

Using numeric features with Naive Bayes Because Naive Bayes uses frequency tables to learn the data, each feature must be categorical in order to create the combinations of class and feature values comprising of the matrix. Since numeric features do not have categories of values, the preceding algorithm does not work directly with numeric data discretize numeric features - the numbers are put into categories known as bins. Discretization is also sometimes called binning

the most common way is to explore the data for natural categories or cut points in the distribution of data. Example added a feature to the spam dataset that recorded the time of night or day the e-mail was sent, from 0 to 24 hours past midnight Speculations If there are no obvious cut points, one option will be to discretize the feature using quantiles

three bins with tertiles four bins with quartiles five bins with quintiles Etc

Too few bins can result in important trends being obscured. Too many bins can result in small counts in the Naive Bayes frequency table, which can increase the algorithm's sensitivity to noisy data Example filtering mobile phone spam with the Naive Bayes algorithm SMS spam unwanted advertising via Short Message Service (SMS) text messages

particularly troublesome because many cellular phone users pay a fee per SMS received additional challenges for automated filters SMS messages are often limited to 160 characters, reducing the amount of text that can be used to identify whether a message is junk. SMS shorthand lingo further blurs the line between legitimate messages and spam Step 1 collecting data

data adapted from the SMS Spam Collection at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/. dataset includes the text of SMS messages along with a label indicating whether the message is unwanted notable characteristics two of the three spam messages use the word "free," yet the word does not appear in any of the ham messages. two of the ham messages cite specific days of the week, as compared to zero in spam messages

Our Naive Bayes classifier will take advantage of word frequency to determine whether the SMS messages seem to better fit the profile of spam or ham - The classifier will compute the probability of spam and ham, given the evidence provided by all the words in the message. Step 2 exploring and preparing the data bag-of-words - ignores word order and simply provides a variable indicating whether the word appears at all

download the sms_spam.csv file from the Packt website > sms_raw <- read.csv("sms_spam.csv", stringsAsFactors = FALSE) Converting type vector to a factor Since the type element is a categorical variable, it would be better to convert it into a factor > sms_raw$type <- factor(sms_raw$type) Data preparation cleaning and standardizing text data

One needs to consider how to remove numbers and punctuation; Handle uninteresting words such as and, but, and or; how to break apart sentences into individual words text mining package titled tm originally created by Ingo Feinerer as a dissertation project at the Vienna University of Economics and Business > install.packages("tm")

> library(tm) creating a corpus Corpus - a collection of text documents In our case, the corpus will be a collection of SMS messages VCorpus() function - refers to a volatile corpus requires us to specify the source of documents for the corpus, which could be from a computer's filesystem, a database, the Web, or elsewhere we'll use the VectorSource() reader function to create a source object from the existing

sms_raw$text vector > sms_corpus <- VCorpus(VectorSource(sms_raw$text)) PCorpus() function - access a permanent corpus stored in a database By specifying an optional readerControl parameter, the VCorpus() function offers functionality to import text from sources such as PDFs and Microsoft Word files Inspect sms_corpus view the actual message text

Single document: > as.character(sms_corpus[[1]]) [1] "Hope you are having a good week. Just checking in Multiple document: > lapply(sms_corpus[1:2], as.character) $`1` [1] "Hope you are having a good week. Just checking in" $`2` [1] "K..give back my thanks."

clean the text tm_map() function provides a method to apply a transformation (also known as mapping) to a tm corpus standardize the messages to use only lowercase characters > sms_corpus_clean <- tm_map(sms_corpus, content_transformer(tolower)) > as.character(sms_corpus[[1]]) [1] "Hope you are having a good week. Just checking in" > as.character(sms_corpus_clean[[1]]) 1] "hope you are having a good week. just checking in"

removing numbers from the SMS messages > sms_corpus_clean <- tm_map(sms_corpus_clean, removeNumbers) removeNumbers() is built into tm along with several other mapping functions that do not need to be wrapped. To see the other built-in transformations, simply type getTransformations() stop words filler words such as to, and, but, and or typically removed prior to text mining

stopwords() function provided by the tm package. access various sets of stop words, across several languages. By default, common English language stop words are used. To see the default list, type stopwords() at the command line.

To see the other languages and options available, type ?stopwords for the documentation page remove stop words & punctuation removeWords() function > sms_corpus_clean <- tm_map(sms_corpus_clean, removeWords, stopwords()) eliminate any punctuation from the text messages built-in removePunctuation() transformation > sms_corpus_clean <- tm_map(sms_corpus_clean, removePunctuation)

> removePunctuation("hello...world") [1] "helloworld" replaces punctuation characters > replacePunctuation <- function(x) { gsub("[[:punct:]]+", " ", x) } this uses R's gsub() function to substitute any punctuation characters in x with a blank space Stemming - reducing words to their root form

E.g., learned, learning, and learns -> learn SnowballC package install.packages("SnowballC") SnowballC package maintained by Milan Bouchet-Valat provides an R interface to the C-based libstemmer library, which is based on M.F. Porter's "Snowball" word stemming algorithm, a widely used open source stemming method.

see http://snowball.tartarus.org wordStem() function > library(SnowballC) > wordStem(c("learn", "learned", "learning", "learns")) [1] "learn" "learn" "learn" "learn" stemDocument() transformation To apply the wordStem() function to an entire corpus of text documents > sms_corpus_clean <- tm_map(sms_corpus_clean, stemDocument) Remove additional whitespace, using the built-in stripWhitespace()

transformation > sms_corpus_clean <- tm_map(sms_corpus_clean, stripWhitespace) Data preparation splitting text documents into words Tokenization - split the messages into individual components A token is a single element of a text string DocumentTermMatrix() function - take a corpus and create a Document Term Matrix (DTM) rows indicate documents (SMS messages)

columns indicate terms (words) Term Document Matrix (TDM) - a transposed DTM rows are terms columns are documents a small portion of the DTM for the SMS corpus the complete matrix has 5,559 rows and over 7,000 columns

sparse matrix - the vast majority of the cells in the matrix are filled with zeros. although each message must contain at least one word, the probability of any one word appearing in a given message is small Create DTM sparse matrix > sms_dtm <- DocumentTermMatrix(sms_corpus_clean) Or, if we hadn't performed the preprocessing > sms_dtm2 <- DocumentTermMatrix(sms_corpus, control = list( tolower = TRUE, removeNumbers = TRUE,

stopwords = TRUE, removePunctuation = TRUE, stemming = TRUE )) comparing sms_dtm to sms_dtm2 slight difference in the number of terms in the matrix ordering of the preprocessing steps DocumentTermMatrix() function applies its cleanup functions to the

text strings only after they have been split apart into words it uses a slightly different stop words removal function To force the two prior document term matrices to be identical override the default stop words function with our own that uses the original replacement function. replace stopwords = TRUE with the following: stopwords = function(x) { removeWords(x, stopwords()) } an important principle of cleaning text data: the order of operations matters

Data preparation creating training and test datasets split the data into training and test datasets it is important that the split occurs after the data have been cleaned and processed divide the data into two portions: 75 percent for training - first 4,169 messages 25 percent for testing - remaining 1,390 messages

> sms_dtm_train <- sms_dtm[1:4169, ] > sms_dtm_test <- sms_dtm[4170:5559, ] labels labels for each of the rows in the training and testing matrices > sms_train_labels <- sms_raw[1:4169, ]$type > sms_test_labels <- sms_raw[4170:5559, ]$type To confirm that the subsets are representative of the complete set of SMS data

Visualizing text data word clouds word cloud - a way to visually depict the frequency at which words appear in text data. The cloud is composed of words scattered somewhat randomly around the figure Words appearing more often in the text are shown in a larger font Less common terms are shown in smaller fonts wordcloud package written by Ian Fellows, his blog at http://blog.fellstat.com/?cat=11

install the package > install.packages("wordcloud") load the package > library(wordcloud) word cloud > wordcloud(sms_corpus_clean, min.freq = 50, random.order = FALSE)

comparing the clouds for SMS spam and ham create a subset where the message type is spam > spam <- subset(sms_raw, type == "spam") create a subset where the message type is ham > ham <- subset(sms_raw, type == "ham") Creating word clouds > wordcloud(spam$text, max.words = 40, scale = c(3, 0.5)) > wordcloud(ham$text, max.words = 40, scale = c(3, 0.5))

Given a vector of raw text strings, wordcloud() function will automatically apply common text preparation processes before displaying the cloud resulting word clouds which one is the spam cloud and which represents ham? Data preparation creating indicator features for frequent words transform the sparse matrix into a data structure that can be used to train a Naive Bayes classifier

To reduce the number of features, we will eliminate any word that appear in less than five SMS messages, or in less than about 0.1 percent of the records in the training data findFreqTerms() function - takes a DTM and returns a character vector containing the words that appear for at least the specified number of times > findFreqTerms(sms_dtm_train, 5) > sms_freq_words <- findFreqTerms(sms_dtm_train, 5) filter our DTM

> str(sms_freq_words) chr [1:1136] "abiola" "abl" "abt" "accept" "access" "account "across" "act" "activ" ... filter our DTM > sms_dtm_freq_train<- sms_dtm_train[ , sms_freq_words] > sms_dtm_freq_test <- sms_dtm_test[ , sms_freq_words] cells in the sparse matrix Numeric -> categorical variable convert_counts() function - convert counts to Yes/No strings:

> convert_counts <- function(x) { x <- ifelse(x > 0, "Yes", "No") } apply convert_counts() to each of the columns apply() function - allows a function to be used on each of the rows or columns in a matrix uses a MARGIN parameter to specify either rows or columns MARGIN = 1 rows MARGIN = 2 columns

convert the training and test matrices > sms_train <- apply(sms_dtm_freq_train, MARGIN = 2, convert_counts) > sms_test <- apply(sms_dtm_freq_test, MARGIN = 2, convert_counts) Step 3 training a model on the data the e1071 package developed in the statistics department of the Vienna University of Technology (TU Wien) install.packages("e1071") and library(e1071)

Naive Bayes learner > sms_classifier sms_test_pred <- predict(sms_classifier, sms_test)

compare the predictions to the true values > library(gmodels) > CrossTable(sms_test_pred, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, dnn = c('predicted', 'actual')) The table The performance 6 + 30 = 36 of the 1,390 SMS messages were incorrectly classified (2.6 percent).

6 out of 1,207 ham messages were misidentified as spam 30 of the 183 spam messages were incorrectly labeled as ham six legitimate messages incorrectly classified as spam could cause significant problems - the filter could cause a person to miss an important text message Step 5 improving model performance set laplace = 1

> sms_classifier2 <- naiveBayes(sms_train, sms_train_labels, laplace = 1) > sms_test_pred2 <- predict(sms_classifier2, sms_test) > CrossTable(sms_test_pred2, sms_test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE, dnn = c('predicted', 'actual')) The results The End of Chapter 4