A Total Beginners Guide to Web Scraping Football Data – Part 2

This mini-tutorial series aims to give the total beginner a few pointers in the direction of scrapping football data from the web.

Part 1 : Introduction & Scrapping a List of Web Links of Clubs
Part 2 : Loop Through the Web Links and Scrape a List of Web Links of Players
Part 3 : Loop Through the Player Links and Collect the Name and TM Value of the Players [up soon]
Part 4 : Using ggplot2 to Create a Density Graph of Players Values within the EPL [up soon]

DISCLAIMER: The data we collect from transfermarkt.com should only be used for private use

At the end of Part 1 we finished with a list of URLs of Team Overview Pages for the EPL 2016/17 Season. Like so..

screenshot-2016-09-21-21-16-06

In Part 2 we are going to create a ‘loop’ which will go through each of our URLs and extract information from each of them. The information we will be extracting is the URL address of all players with each squad. Leaving us with a list of 516 EPL players URL addresses for their personal overview pages, which we will then extract further data from in Part 3. Loops are our friends and are the key to scrapping large amounts of data from a multitude of webpages whilst we sit back and watch. Loops are also very easy to setup!

Step 1 : Setup the Empty Data Frame to Store Data

As we loop through the webpages in our list we want to store the information in a safe place. Within R, data can be stored in data frames. To catch our data that is generated by our loop we will create an empty data frame that we fill up as we run the loop of code.

library(rvest)

URL <- "http://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1"

WS <- read_html(URL)

URLs <- WS %>% html_nodes(".hide-for-pad .vereinprofil_tooltip") %>% html_attr("href") %>% as.character()

URLs <- paste0("http://www.transfermarkt.com",URLs)

Catcher1 <- data.frame(Player=character(),P_URL=character())

Let me break down the new line:

Catcher1 <-
This segment means that everything to the right will be saved as Catcher1

data.frame()
This fucntion creates a data frame

Player=Character()
We create a string variable within our data frame called Player. This can be named anything but Player names sense in this occasion.

P_URL=Character()
We create a string variable within our data frame called P_URL which will be the web addresses for the players within each squad.

Select the new line of code. Run it. In the console enter print(Catcher1).<0 rows> confirms that the data.frame is empty and waiting to be filled.

Screenshot 2016-09-22 00.09.51.png

Step 2 : Create the Overall Structure of the ‘For’ Loop

There are different types of loops but we are going to use a ‘for’ loop. Simply … ‘For’ each item in this list do this _____.

library(rvest)

URL <- "http://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1"

WS <- read_html(URL)

URLs <- WS %>% html_nodes(".hide-for-pad .vereinprofil_tooltip") %>% html_attr("href") %>% as.character()

URLs <- paste0("http://www.transfermarkt.com",URLs)

Catcher1 <- data.frame(Player=character(),P_URL=character())
for (i in URLs) {


#### put the code we want to run for each webpage in the list here 


}

Let me break down the new code:

for (i in URLs)
This simply means for each item, or i, in our list called URLs

{
Signifies the opening to our code that we want repeating

}
Signifies the closing to our code that we want repeating. Everything between the { and the } will be run for each item within our list URLs. In our case within each Club Overview webpage.

Step 3 : Filling our Loop with Instructions

We want our code to go through each webpage and extract the URL for each players overview page. I will introduce each line of the code within the loop but we will only run the code once we have written it all up.

Code Line 1

As in part 1 we want to ‘read’ a webpage into R in order to access the elements we want to scrape.

for (i in URLs) {

WS1 <- read_html(i)

}

WS1 <- read_html(i)
We therefore assign WS1 (web scrape 1, but can be named anything) with the information from each webpage. The URL changes each time R loops through the code as we allocate ‘i’ into the () of read_html.

Code Line 2

We want to scrape each player’s name from the club’s overview page. To do this, we first open up one of the club overview page – lets open up Manchester United’s page. We need to follow the same process in Part 1 by using SelectorGadget to isolate the element we want to extract. In this case the player names of the squad. After selecting the element I want and then deselected unwanted elements I am left with 54 items – this is more than the players in the squad, but I can’t see any other element on the page still highlighted in yellow. This could be a problem. However, as I can’t see any other element highlighted in yellow I am going to continue to scrape the data and see what comes out at the end. Importantly, this process has given me the unique CSS identifier for the element I want “#yw1 .spielprofil_tooltip”.

Screenshot 2016-09-22 06.58.22.png

for (i in URLs) {

WS1 <- read_html(i)
Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()

}

Player <-
This segment means that everything to the right will be saved as a variable called Player

WS1
Is simply selecting our read webpage

html_nodes(“#yw1 .spielprofil_tooltip”)
Selects all elements that match the unique CSS identifier that we ascertained from our work with SelectorGadget.

html_text()
Unlike in Part1 where we took information from the html attribute, for this we are simply wanting the displayed text. html_text() does exactly this.

as.character()
Saves the selected items as a string

%>%
As explained in Part1 This is called piping and is used to link bits of code together and pass information from one bit of code to the next. It makes coding is R much quicker and more logical.

Code Line 3

We now want to find the web address for each player, luckily this is found in the same CSS identifier but in the attribute information. This is an identical process as in Part1 just with a different unique CSS identifier.

for (i in URLs) {

WS1 <- read_html(i)
Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()
P_URL <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_attr("href") %>% as.character()

}

P_URL <-
This segment means that everything to the right will be saved as P_URL (short for Players URL, once again this could be named anything).

WS
Is simply selecting our read webpage

html_nodes(“#yw1 .spielprofil_tooltip”)
Selects all elements that match the unique CSS identifier that we ascertained from our work with SelectorGadget.

html_attr(“href”)
This selects an attribute of the matched elements and we have selected the “href” attribute which is the link address.

as.character()
Saves the selected items as a string

%>%
This is called piping and is used to link bits of code together and pass information from one bit of code to the next. It makes coding is R much quicker and more logical.

Code Line 4

Once the code collects all data we want to store this within the empty dataframe we created called Catcher1. The first step of doing this is by creating a simply temporary dataframe which we will use to quickly store the data for the team we scrapped. We call it ‘temp’. It will consist of the ‘Player’ variable and the P_URL variable.

for (i in URLs) {

WS1 <- read_html(i)
Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()
P_URL <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_attr("href") %>% as.character()
temp <- data.frame(Player,P_URL)
}

temp <-
This segment means that everything to the right will be saved as temp

data.frame(Player,P_URL)
This creates a data frame with the columns Player, and P_URL

Code Line 5

Now we are in a great position of having the data from the club’s page in a dataframe but unless we save that data to our Catcher1 dataframe as the code loops through the next club it will simply overwrite the ‘temp’ dataframe and we will be only left with the data from the last club in the loop.  We are going to use the ‘rbind’ function which sticks the rows of dataframes together with ease (as long as the columns are the same in name and number).

for (i in URLs) {

WS1 <- read_html(i)
Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()
P_URL <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_attr("href") %>% as.character()
temp <- data.frame(Player,P_URL)
Catcher1 <- rbind(Catcher1,temp)
}

Catcher1 <-
This segment means that everything to the right will be saved as Catcher1, but cause we already have a dataframe called Catcher1 it will simply overwrite it with the new code.

rbind(Catcher1,temp)
This takes both dataframes we want to paste together and joins them effortlessly. This has essentially filled up Catcher1 with the information from the scrape and stored it safely outside of the loop. Fantastic progress!

Code Line 5

When we run web scrapping code it sometimes takes a long time to complete the process and return the results, as you wait you have no idea if the code is running properly or not. If would therefore be useful if we added some sort of indication of progress, we do this with the cat() function.

for (i in URLs) {

WS1 <- read_html(i)
Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()
P_URL <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_attr("href") %>% as.character()
temp <- data.frame(Player,P_URL)
Catcher1 <- rbind(Catcher1,temp)
cat("*")
}

cat(“*”)
When the looping code reaches this line it will print * to the console, this will give us an extremely basic indication in the code is running well and pages are being scraped. You can replace the * with anything that you want to be printed to the console “Processed”, “Grabbed”, “Get in!”

Step 3 : Run the Code and Hit a Wall

Select all of the new code we wrote today and press run. After a few minutes (depending on your internet connection) the process will be completed and you will be left with this within the console.

Screenshot 2016-09-22 07.32.25.png

Now lets check out results by entering head(Catcher1) to the console, the head() function prints just the top 6 rows of our dataframe for us to look at without having to print all 1032 rows! Which leaves us with this: Screenshot 2016-09-22 07.34.21.png

Ah, we have a problem!! The scrape has collected each player in the EPL twice, once with his full name and once with his first name as an initial. Damn… now we have to find a solution to fix the problem… *scratches head* …. ok well I want to keep the rows with the full names and these happen to be all on odd numbered rows. Unfortunately I have no idea how to select the odd rows within this dataframe. I have hit a wall in my knowledge. Let’s Google for an answer….

Screenshot 2016-09-22 07.40.53.png

The top result looks great! I also notice that it links to the site stackoverflow.com which is the ultimate result for finding solutions when you have hit a wall. Its a fantastic resource where you can ask questions, post your code and get answers. However, you need to follow the rules to get people to help you!

  1. The problem you are experiencing has most likely been experience before and asked on Stack Overflow. Therefore, before answering a question search hard for the answer on the site.
  2. Make sure you post a question that follows this guide
  3. If someone answers your question, up vote and mark it as the right answer. Good manners would also to use the comments section to thank the he/she afterwards.

The page we found via Google is exactly what we want! Thank god for the internet!

screenshot-2016-09-22-07-41-23

As its the question I want answering, I will scroll down and see if anyone has got an accepted answer:

screenshot-2016-09-22-07-41-31

Bingo!!!! Mali Remorker you legend! Even though I don’t fully understand the code we will try and get it working within our script and by doing so we might begin to understand the code so we can reuse it on another project without having to Google search it. I have found so many solutions on StackOverview and it remains the biggest influence on me being able to learn R.

Step 4 : Breaking Through the Wall

odd_indexes<-seq(1,41,2) ………. this is the solution but we need to understand it a little before we can successfully implement it into our code. It uses the function seq() which is short for ‘sequence’. To better understand the function did a bit of Googling about this function and also used RStudio’s built in help() function … enter help(seq) into the console. This is what I learnt:

seq(starting number, ending number, gap between the numbers)

So for our example we want to use:

seq(1,1032,2)

Starting Number = 1, our first odd row in our dataset

Ending Number = 1032, this is the number of rows in our Catcher1 dataframe

Gap Between the Numbers = 2, odd numbers are space 2 numbers apart.

Therefore the code we want to use is…

odd_indexes<-seq(1,1032,2)

This will give us a list of odd numbers from 1-1032, but let’s pause. What if we wanted to use the same code to scrape La Liga and there were 1052 rows in our Catcher1 dataframe? This would result in 10 players being lost from the dataset and we may not even realise the error. Therefore a better way of future proofing our code would be to use a function to find the number of rows in our Catcher1 data frame rather than use the exact number of 1032. Another trip to StackExchange got me just what I was looking for, the nrow() function!

no.of.rows <- nrow(Catcher1)

no.of.rows <-
This segment means that everything to the right will be saved as the variable no.of.rows

nrow(Catcher1)
Returns the number of rows within the Catcher1 dataframe

We can now improve our code by replacing our line of code

no.of.rows <- nrow(Catcher1)

odd_indexes<-seq(1,1032,2)
odd_indexes<-seq(1,no.of.rows,2)

Great progress, we have now got a sequence of odd numbers for the length of our Catcher1 dataframe. Now we want to select these rows and remove the even numbered rows so we have no duplicated players within Catcher1.

no.of.rows <- nrow(Catcher1)
odd_indexes<-seq(1,no.of.rows,2)
Catcher1 <- data.frame(Catcher1[odd_indexes,1:2])

Catcher1 <-
This segment means that everything to the right will be saved as Catcher1, but cause we already have a dataframe called Catcher1 it will simply overwrite it with the new code.

data.frame()
Creates a dataframe

Catcher1[odd_indexes,]
Selects only odd number rows (taking the numbers we have created in our odd_indexes list).

Make sure you overall code is like the below, select all the code from “for (i ….” and down and run.

library(rvest)

URL <- "http://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1"

WS <- read_html(URL)

URLs <- WS %>% html_nodes(".hide-for-pad .vereinprofil_tooltip") %>% html_attr("href") %>% as.character()
URLs <- paste0("http://www.transfermarkt.com",URLs)

Catcher1 <- data.frame(Player=character(),P_URL=character())

for (i in URLs) {
 
 WS1 <- read_html(i)
 Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()
 P_URL <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_attr("href") %>% as.character()
 temp <- data.frame(Player,P_URL)
 Catcher1 <- rbind(Catcher1,temp)
 cat("*")
}

no.of.rows <- nrow(Catcher1)
odd_indexes<-seq(1,no.of.rows,2)
Catcher1 <- data.frame(Catcher1[odd_indexes,])

The code should run smoothly, once finished enter into the console print(Catcher1), which should leave you with the following:

Screenshot 2016-09-22 08.33.21.png

 

Step 4 : The Sharp Eyed Fix

The sharp eyed among you will notice the selected links are missing “http://transfermarkt.com” and will therefore not work when we come to use them in Part 3. We rectify this using the ‘paste0’ function which we can use to stick strings together.

library(rvest)

URL <- "http://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1"

WS <- read_html(URL)

URLs <- WS %>% html_nodes(".hide-for-pad .vereinprofil_tooltip") %>% html_attr("href") %>% as.character()
URLs <- paste0("http://www.transfermarkt.com",URLs)

Catcher1 <- data.frame(Player=character(),P_URL=character())

for (i in URLs) {
 
 WS1 <- read_html(i)
 Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()
 P_URL <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_attr("href") %>% as.character()
 temp <- data.frame(Player,P_URL)
 Catcher1 <- rbind(Catcher1,temp)
 cat("*")
}

no.of.rows <- nrow(Catcher1)
odd_indexes<-seq(1,no.of.rows,2)
Catcher1 <- data.frame(Catcher1[odd_indexes,])

Catcher1$P_URL <- paste0("http://www.transfermarkt.com",Catcher1$P_URL)

I will break our new line down for us:

Catcher1$P_URL <-
Catcher1 is selected then using the $ sign we can then select an specific column, in this case P_URL. Once again this segment means that everything to the right will be saved as P_URL within our Catcher1 dataframe. In fact, because he already have values for URLs this will overwrite our previous values.

paste0(“http://www.transfermarkt.com&#8221;,Catcher1$P_URL)
This will ‘paste’ together the string between the “” and the value in each row of Catcher1 within the P_URL column.

Select our new line and press run and once the code as run, in the console enter : head(Catcher1). You should be left with the below:

Screenshot 2016-09-22 08.45.05.png

Fantastic!!! A summary of what we have achieved….

  1. Entered the webpage of a league from transfermarkt
  2. Got R to extract all of the links to the webpages of the clubs
  3. Got R to loop through all of the clubs and extract all of the links to the webpages of the players

We are perfectly setup for Part3 where we will write a loop to get R to loop through each of the player pages and extract transfermarkt’s ‘Market Value’.

 

Advertisements

4 thoughts on “A Total Beginners Guide to Web Scraping Football Data – Part 2

  1. hey,

    i really liked the first 2 parts of this series and it gave me a good help to start working with R.

    when will you publish part 3 and 4? I’m really looking forward to it!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s