Quickly Clean & Visualize Data from Moz’s OSE Backlink Exports using R [Video]

You think it's good?

Data Munging & Cleaning Sucks!

Make it faster with R!

R is a free software programming language and a software environment for statistical computing and graphics. It used by statisticians, data miners, and anyone who does (or wants to do) serious data analysis. You can do some very heavy linear or nonlinear modeling, statistical tests, all kinds of clustering and so much more.

I started learning R last year when I took a Data Analysis class on Coursera. Obviously I’m still learning it and just now getting into the real good parts but figured I share some cool uses that might attract some new people to the language.

Use case: Cleaning, organizing, and visualizing backlinks from Moz’s Open Site Explorer. Here is what I want to do:

    • I want to clean up the headers.
    • I want to quickly see the distribution of Domain Authority & Page Authority.
    • I want to see any quick graphs to look for correlation.
    • I want to quickly see the distribution and density of the different sub-domains.
    • I want to be able to save the clean version and open it up in excel.

NOTE: You can do the same things in Excel but I find it faster in R and I know how to write R better then I know how to write VBA. Also, I believe, if you are in the data analysis world you should try and learn and data analysis programming language and this could hopefully help you get interested in R.

Download and Install R here then install R Studio, which is a much more intuitive GUI.

In the video below I go explain the whole process, explain some of the code, and show the graphs.

The code break down:

Import the data and load necessary packages:


library(ggplot2)

backlinks <- read.csv("/backlinks.csv")

I like having the root domains for the url’s that the links are on and root domains of where they’re linking to. For example take www.domain.com/category/123 and add it to a new column as domain.com.


cleanurl <- backlinks$URL
cleantarget <- backlinks$Target.URL
backlinks$cleanURL <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleanurl)
backlinks$cleantarget <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleantarget)

I’m not crazy about the headers in the export. I like my data (and headers) as uniform and concise as possible.


#clean up dots, lowercase, and shorten names
names(backlinks) <- tolower(names(backlinks))
names(backlinks) <- gsub("\\.","",names(backlinks))
names(backlinks)[[5]] <- "da"
names(backlinks)[[4]] <- "pa"
names(backlinks)[[6]] <- "links"
names(backlinks)[[7]] <- "lrds"

I love love love looking at distribution & histograms and think it is not used enough in internet marketing. So I want to see how spread out the links are.

#find the ranges of da
daRanges <- cut(backlinks$da,seq(0,100,by=10))
paRanges <- cut(backlinks$pa,seq(0,100,by=10))

#add those ranges
backlinks$daranges <- daRanges
backlinks$paranges <- paRanges
#creating tables for future reference
daFactorTable <- table(daRanges,useNA="ifany")
paFactorTable <- table(paRanges,useNA="ifany")

Lets take a look at the ranges

daFactorTable

Also lets check out how many subdomains they have

table(backlinks$cleantarget,useNA="ifany")


Now for the fun stuff – Graphs!

#BAR GRAPH

qplot(factor(backlinks$cleantarget), data=backlinks,geom="bar",fill=backlinks$followable)
#SCATTER PLOTS
#follow vs non-follow
qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$followable)
#follow vs non-follow + jitter
qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$followable, geom="jitter")
# clean targets + jitter
qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$cleantarget, geom="jitter")

#HISTOGRAMS
qplot(backlinks$da, data = backlinks, geom = "histogram",binwidth = 10)
#follow vs non-follow
qplot(backlinks$da, data = backlinks, fill=backlinks$followable, geom = "histogram",binwidth = 10)
#clean targets
qplot(backlinks$da, data = backlinks, fill=backlinks$cleantarget, geom = "histogram",binwidth = 10)

#DENSITY PLOTS
qplot(backlinks$da, data = backlinks, geom = "density")
#follow vs non-follow
qplot(backlinks$da, data = backlinks, geom = "density", color=backlinks$followable)
#clean targets
qplot(backlinks$da, data = backlinks, geom = "density", color=backlinks$cleantarget)

Updated 11/30 Here are some functions that you can load into R

mozCleanUrls = function(x){
 cleanurl <- x$URL
 cleantarget <- x$Target.URL
 x$cleanURL <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleanurl)
 x$cleantarget <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleantarget)
 return(x)
}

mozHeadClean = function(x){
 names(x) <- tolower(names(x))
 names(x) <- gsub("\\.","",names(x))
 names(x)[[5]] <- "da"
 names(x)[[4]] <- "pa"
 names(x)[[7]] <- "lrds"
 names(x)[[6]] <- "links"
 return(x)
}

dapaRanges = function(x){
 daRanges <- cut(x$da,seq(0,100,by=10))
 paRanges <- cut(x$pa,seq(0,100,by=10))
 x$daranges <- daRanges
 x$paranges <- paRanges
 daFactorTable <- table(daRanges,useNA="ifany")
 paFactorTable <- table(paRanges,useNA="ifany")
 return(x)
}

 

Here is all the code together and uncommented

backlinks <-read.csv('./backlinks_for_R.csv')
RAWbacklinks <- backlinks
cleanurl <- backlinks$URL
cleantarget <- backlinks$Target.URL
backlinks$cleanURL <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleanurl)
backlinks$cleantarget <- sub("^http://(?:www[.])?([^/]*).*$", "\\1", cleantarget)
names(backlinks) <- tolower(names(backlinks))
names(backlinks) <- gsub("\\.","",names(backlinks))
names(backlinks)[[5]] <- "da"
names(backlinks)[[4]] <- "pa"
names(backlinks)[[6]] <- "links"
names(backlinks)[[7]] <- "lrds"
daRanges <- cut(backlinks$da,seq(0,100,by=10))
paRanges <- cut(backlinks$pa,seq(0,100,by=10))
backlinks$daranges <- daRanges
backlinks$paranges <- paRanges
daFactorTable <- table(daRanges,useNA="ifany")
paFactorTable <- table(paRanges,useNA="ifany")
qplot(factor(backlinks$cleantarget), data=backlinks,geom="bar",fill=backlinks$followable)
qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$followable)
qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$followable, geom="jitter")
qplot(backlinks$pa, backlinks$da, data = backlinks,colour = backlinks$cleantarget, geom="jitter")
qplot(backlinks$da, data = backlinks, geom = "histogram",binwidth = 10)
qplot(backlinks$da, data = backlinks, fill=backlinks$followable, geom = "histogram",binwidth = 10)
qplot(backlinks$da, data = backlinks, fill=backlinks$cleantarget, geom = "histogram",binwidth = 10)
qplot(backlinks$da, data = backlinks, geom = "density")
qplot(backlinks$da, data = backlinks, geom = "density", color=backlinks$followable)
qplot(backlinks$da, data = backlinks, geom = "density", color=backlinks$cleantarget)

You can also use this for ahrefs and majestic data you just need to rewrite some of the formatting. I am working on a post of combing all and will update this post when I do.

Let me know if you have any questions in the comments below.

Are you using R in any cool ways for simple data munging? If so, add it to the comments.

Hi, thanks for reading! If you liked this post and saw value in it please consider sharing it. There are share buttons at the top.
Loading Facebook Comments ...