R is a free software programming language and a software environment for statistical computing and graphics. It used by statisticians, data miners, and anyone who does (or wants to do) serious data analysis. You can do some very heavy linear or nonlinear modeling, statistical tests, all kinds of clustering and so much more.
I started learning R last year when I took a Data Analysis class on Coursera. Obviously I’m still learning it and just now getting into the real good parts but figured I share some cool uses that might attract some new people to the language.
NOTE: You can do the same things in Excel but I find it faster in R and I know how to write R better then I know how to write VBA. Also, I believe, if you are in the data analysis world you should try and learn and data analysis programming language and this could hopefully help you get interested in R.
Download and Install R here then install R Studio, which is a much more intuitive GUI.
Import the data and load necessary packages:
library(ggplot2) backlinks <- read.csv("/backlinks.csv")
I like having the root domains for the url’s that the links are on and root domains of where they’re linking to. For example take www.domain.com/category/123 and add it to a new column as domain.com.
cleanurl <- backlinks$URL cleantarget <- backlinks$Target.URL backlinks$cleanURL <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleanurl) backlinks$cleantarget <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleantarget)
I’m not crazy about the headers in the export. I like my data (and headers) as uniform and concise as possible.
#clean up dots, lowercase, and shorten names names(backlinks) <- tolower(names(backlinks)) names(backlinks) <- gsub("\\.","",names(backlinks)) names(backlinks)[[5]] <- "da" names(backlinks)[[4]] <- "pa" names(backlinks)[[6]] <- "links" names(backlinks)[[7]] <- "lrds"
I love love love looking at distribution & histograms and think it is not used enough in internet marketing. So I want to see how spread out the links are.
#find the ranges of da daRanges <- cut(backlinks$da,seq(0,100,by=10)) paRanges <- cut(backlinks$pa,seq(0,100,by=10)) #add those ranges backlinks$daranges <- daRanges backlinks$paranges <- paRanges #creating tables for future reference daFactorTable <- table(daRanges,useNA="ifany") paFactorTable <- table(paRanges,useNA="ifany")
Lets take a look at the ranges
daFactorTable
Also lets check out how many subdomains they have
table(backlinks$cleantarget,useNA="ifany")
Now for the fun stuff – Graphs!
#BAR GRAPH qplot(factor(backlinks$cleantarget), data=backlinks,geom="bar",fill=backlinks$followable) #SCATTER PLOTS #follow vs non-follow qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$followable) #follow vs non-follow + jitter qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$followable, geom="jitter") # clean targets + jitter qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$cleantarget, geom="jitter") #HISTOGRAMS qplot(backlinks$da, data = backlinks, geom = "histogram",binwidth = 10) #follow vs non-follow qplot(backlinks$da, data = backlinks, fill=backlinks$followable, geom = "histogram",binwidth = 10) #clean targets qplot(backlinks$da, data = backlinks, fill=backlinks$cleantarget, geom = "histogram",binwidth = 10) #DENSITY PLOTS qplot(backlinks$da, data = backlinks, geom = "density") #follow vs non-follow qplot(backlinks$da, data = backlinks, geom = "density", color=backlinks$followable) #clean targets qplot(backlinks$da, data = backlinks, geom = "density", color=backlinks$cleantarget)
mozCleanUrls = function(x){ cleanurl <- x$URL cleantarget <- x$Target.URL x$cleanURL <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleanurl) x$cleantarget <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleantarget) return(x) } mozHeadClean = function(x){ names(x) <- tolower(names(x)) names(x) <- gsub("\\.","",names(x)) names(x)[[5]] <- "da" names(x)[[4]] <- "pa" names(x)[[7]] <- "lrds" names(x)[[6]] <- "links" return(x) } dapaRanges = function(x){ daRanges <- cut(x$da,seq(0,100,by=10)) paRanges <- cut(x$pa,seq(0,100,by=10)) x$daranges <- daRanges x$paranges <- paRanges daFactorTable <- table(daRanges,useNA="ifany") paFactorTable <- table(paRanges,useNA="ifany") return(x) }
backlinks <-read.csv('./backlinks_for_R.csv') RAWbacklinks <- backlinks cleanurl <- backlinks$URL cleantarget <- backlinks$Target.URL backlinks$cleanURL <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleanurl) backlinks$cleantarget <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleantarget) names(backlinks) <- tolower(names(backlinks)) names(backlinks) <- gsub("\\.","",names(backlinks)) names(backlinks)[[5]] <- "da" names(backlinks)[[4]] <- "pa" names(backlinks)[[6]] <- "links" names(backlinks)[[7]] <- "lrds" daRanges <- cut(backlinks$da,seq(0,100,by=10)) paRanges <- cut(backlinks$pa,seq(0,100,by=10)) backlinks$daranges <- daRanges backlinks$paranges <- paRanges daFactorTable <- table(daRanges,useNA="ifany") paFactorTable <- table(paRanges,useNA="ifany") qplot(factor(backlinks$cleantarget), data=backlinks,geom="bar",fill=backlinks$followable) qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$followable) qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$followable, geom="jitter") qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$cleantarget, geom="jitter") qplot(backlinks$da, data = backlinks, geom = "histogram",binwidth = 10) qplot(backlinks$da, data = backlinks, fill=backlinks$followable, geom = "histogram",binwidth = 10) qplot(backlinks$da, data = backlinks, fill=backlinks$cleantarget, geom = "histogram",binwidth = 10) qplot(backlinks$da, data = backlinks, geom = "density") qplot(backlinks$da, data = backlinks, geom = "density", color=backlinks$followable) qplot(backlinks$da, data = backlinks, geom = "density", color=backlinks$cleantarget)
Let me know if you have any questions in the comments below.
Are you using R in any cool ways for simple data munging? If so, add it to the comments.