In this Part 1 I will share how I built a web scraper using Python and Selenium to obtain all the data that I needed without having to manually click around and copy/paste all the data, which would probably take days.
In this Part 1 I will share how I built a web scraper to obtain all the data that I needed without having to manually click around and copy/paste all the data, which would probably take days.
In this series I will share my journey on how I built a model to predict the lowest grade needed to get into a certain program at a certain University in Sweden. It all started with me getting frustrated at the lack of functionality and visualization of the data on the official web site for admission statistics in Sweden (https://statistik.uhr.se/). The entire code can be found here at my Github page
To me there were several problems with the site:
- You had to click around like a maniac to view all the data and could not get an overview with all the stats you wanted.
- Everything was in tabular form, i.e. no graphs
- You could only see the stats of one semester at a time
“If you want a thing done well, do it yourself.” — Napoleon Bonaparte
To get anything useful out of the site I realized that I had to build a web scraper in order to extract all the data I needed in order to build a database from where I could do some analysis, visualization and hopefully (accurate) predictions.
I had absolutely no experience in web scraping prior to this so I had no idea what I got myself into or where to start. As with almost every problem or question in today’s society I visited everybody’s good friend Google who directed me to BeautifulSoup. I began learning it through mainly youtube and some articles/blog posts and must say it was a very effective and easy to use package, however I rather quickly ran into some problems which I could not solve with BeautifulSoup (If you indeed can use BS for this, please let me know in the comments). The problems I faced was that in order to extract all the data I wanted in one go I had to click around and make choices with various filters to show the different tables that the data was presented in, and this could not be accessed through just using the URL since it was unchanged no matter what choices you made or where you clicked on the site.
To solve my problem I had to revisit my old friend google and this time he pointed me to Selenium. So it was back to (almost) square one and new youtube tutorials and blog posts. However, this time around I felt like I had come to the right place. With Selenium I could easily write code to manipulate drop down lists, radio buttons, search fields etc.. Now I could incrementally work my way forward, starting with the first page that came up after you hit “search” to then add step after step.
As I mentioned I started with the first page that came up after hitting “search”, and on this page I could extract the data for how many applicants each program had. First I built a semi complex model to extract specific data for each row and column which consisted of a lot of lines of code which were all connected to a certain xpath. However, thankfully there is Pandas. I later realized that I could just as well use pandas since it has a funktion to read and extract html with the tag “tables”, which all my data had. So i scraped all the admissions data into a list of dataframes and went on to the next interesting category of data using the same methodology.
Patience my friend, patience…
One problem I ran into rather quickly was something called “NoSuchElementException: Message: no such element: Unable to locate element”. Basically what this means is that the code sometimes runs faster than the page it is trying to scrape. For instance, if the program is supposed to click the search button and then extract the result but the results have not appeared yet when the code tries to access a certain xpath, tag name, or id it will not find said target and an NoSuchElementException will occur. To solve this you can til the program to either wait and repeatedly try for a certain amount of seconds before raising an error (Implicit Waits) or wait until a certain condition is met, e.g. an object is present (Explicit Waits). I chose to write a function using explicit waits where the condition was the presence of both a new table (with a new number) and the first row of data. I then used this function whenever the program clicked on an element and a new table was about to appear.
Page 2 and onwards
Another problem I had was getting the code to stop when I had reached the last page. I tried several solutions for this:
- Stop when the next button is not present. the problem, it always was.
- A loop with a range based on the number of elements of page numbers shown. The problem, whenever there were 5 or more pages it was only 5 page numbers shown.
Finally I realized that there were two pieces of information on the page that were constant who decided the number of pages. (1) the total number of results and (2) the number of results per page
Then I could write a code who extracted this information and made it the variable I needed to loop through all the pages.
One of the original problems with the page was the ability to only show the data of one semester at a time. So once I had built a scraper that could extract all the data for one semester I had to make it dynamic to perform the same tasks for each semester I wanted. I decided to put all the previous code in a while loop where the index number a semester from the drop down list of semesters where the deciding factor when to stop. I also added the index number as an input of the entire scraper, i.e. if I only want to scrape one semester I enter 1, if I want the last 20 semesters I enter 20 and so on. I also had to make an exception for the upcoming semester which has no admission grade data but is added to the list.
MERGING THE DATA
One final problem I encountered was how I would combine all the data into one Dataframe. All the application data (i.e. number of applicants, gender and age of the applicants) was easy since it was all structured the same way with all the data from one program on each row of each respective table. Hence, I could do a merge on the code for each program (which was unique for each program, school and semester). However, the data for the admission grades were structured in a different way making a direkt merge hard (or at least I did not know how to solve it). So I had to create a new dataframe where the program code acted as an index and the different admission groups (if you were accepted based on your grades, SAT etc.). Once I had it structured in the same way as the rest the merge was easy and voila, I had a complete Dataframe with all the data in one place.
Ready to rumble
Now I was finally ready to put my program to the test and extract all the data I wanted. So I set it on the last 25 semesters, i.e. 12 years which was the limit of available data at the site. Furthermore, I limited my results to one University for my first try. The code ran for 25:39 and all together the data consisted of 14 795 rows of data and a total of 295 900 elements. This was the first 5 rows of the result.
Thank you for reading this Part 1 of this project. In the next part I will begin to analyze and visualize the data. If you have any questions or suggestions on how I could improve my code, please feel free to drop me a comment below.