Compiling a UFC Events and Matches Dataset

Motivation

Github: https://github.com/cinhui/ufc-events-stats
Technologies used: Python, Beautiful Soup, Plot.ly

The motivation behind this project is to compile a dataset of matches from past UFC events. The dataset can be used to analyze fighter wins/losses/draws records, title defenses, and other fighter activity within the UFC. The dataset would also include information on the date and location of each UFC event.

The first Ultimate Fighting Championship event, UFC 1, took place on November, 12 1993 in Denver, Colorado. Since then, UFC has held over 500 events across the world. Their events include Pay-Per-View events as well as network-aired events.

Data Collection

The UFC events data was obtained from Wikipedia since it is a rich structured data, making it a good source to scrape. The data collection was performed via web scraping using Python and Beautiful Soup.

The first step was to collect a list of past UFC events using the Wikipedia page that contains a table with past events along with links to their respective Wikipedia pages.

# Visit the wikipedia page
# Last visited 4/27/2020
url = 'https://en.wikipedia.org/wiki/List_of_UFC_events'
url_request = requests.get(url).text
soup = BeautifulSoup(url_request, 'html.parser')

# Find the table for Past Events and store it into a data frame
events_data = []
table = soup.find('table',{'id':'Past_events'})
table_rows = table.find_all('tr')
for row in table_rows[1:]:
     events_data.append([t.text.strip() for t in row.find_all('td')])
     events_df = pd.DataFrame(events_data, columns=['Index', 'Event', 'Date', 'Venue', 'Location', 'Attendance','Ref'])
...
...
# Write list of past UFC events to csv file
events_df.to_csv("list_of_UFC_past_events.csv", index=False)

The cancelled events are excluded, since they don’t contain any matches.

events_df = pd.read_csv("data/list_of_UFC_past_events.csv")
events_occurred_df = events_df[events_df['Attendance']!="Cancelled"]

Next, we visit each event’s Wikipedia page and to obtain the table that contains the match results.

result_df = pd.DataFrame()
for index, row in events_occurred_df.iterrows():
    event_name = row['Event']
    event_url = row['wikipage']
    event_date = row['Date']

    try:
        url_request = requests.get(event_url).text
        soup = BeautifulSoup(url_request, 'html.parser')
        data = []
        table = soup.find('table',{'class':'toccolours'})
        table_rows = table.find_all('tr')
        for row in table_rows:
            data.append([t.text.strip() for t in row.find_all('td')]) 
            df = pd.DataFrame(data, columns=['Weight class', 'Fighter1', 'Result', 'Fighter2', 'Method', 'Round', 'Time','Note'])
            df = df[~df['Weight class'].isnull()]
            df.insert(0, "Date", event_date) 
            df.insert(1, "Event", event_name)
    except:
        print("Error")

There were several links that produced an error, mainly because some of the older event Wikipedia pages followed a different format. For example, there were some events that shared the same Wikipedia page. Therefore assuming the first table element with the class toccolours contained the match information was not a valid assumption. Those pages were inspected manually and supplementary scraping code was written to gather the data.

Missing Data

There was also one event that did not have a Wikipedia page. Therefore, the match data was manually entered after doing a web search for the event results.

# Add info for this event
event_date = 'Aug 6, 2005'
event_name = 'UFC Ultimate Fight Night'
# Sources: https://www.ufc.com/event/UFC-Fight-Night-1
# https://www.sherdog.com/events/UFC-Fight-Night-1-Marquardt-vs-Salaverry-3100

The data was then appended to one single data frame and written to a file.

result_df.to_csv("data/ufc_matches.csv", index=False)
result_df.to_json("data/ufc_matches.json",orient='records')

Data Cleaning

The data cleaning component involved inspecting the columns of data for any missing values or incomplete data. Some of the earlier events did not have weight classes or a specified number of rounds, since weight classes were not adopted until a later time. The missing weight classes were replaced with a “-” and the number of rounds was set to 1.

There were some inconsistencies and typos in the women’s weight class labels, which were corrected. There should be eight men’s divisions and four women’s divisions in UFC. There were also matches that are catchweight. When a fighter doesn’t make the weight limit for the weight class fight, the UFC offers it as a catchweight fight with the weight of the heavier fighter.

There were two matches that were missing Time values. The values were found by searching the web and manually entered in. A Time (secs) column was introduced that contained the total time of the match in seconds. The Date column was casted into a DateTime object so that the data frame can be easily sorted by event date.

The resulting data frame was then saved to a new file.

Exploratory Data Analysis

The initial dataset contained matches up to March 14th, 2020 (UFC Fight Night: Lee vs Oliveira). Since UFC 249: Ferguson vs Gaethje took place on May 9th, 2020, the match data for that event was later appended to the dataset.

# Load dataset
matches_df = pd.read_csv("data/ufc_matches_cleaned.csv")
# Load supplementary dataset
df = pd.read_csv("data/UFC 249 Ferguson vs. Gaethje.csv")
matches_df = matches_df.append(df, sort=True)

As of May 9th, 2020, there has been 523 events, 9 of which were cancelled.

# Quick summary statistics
Time frame: 1993-11-12 00:00:00 to 2020-05-09 00:00:00 
Total number of matches: 5576 
Total number of events: 514 
Total number of dates: 507 
Total number of rounds: 12782.0

Events Over Time

Here are the number of events over time by month. Since 2006, UFC had been holding events more frequently, with at least one event each month.

In fact, April of 2020 was the first month without a UFC event since December 2005. This was because of event cancellations due to the Coronavirus pandemic.

Here is the plot of the number of days between consecutive UFC events since 2006. The average number of days between events were 11.77 days and the median was 7 days. Since 2013, UFC has been producing more frequent events, as shown in the plot.

The longest break between events since 2006, was between UFC 56 (on Nov 19th, 2005) and UFC Fight Night 3 (on Jan 16th, 2006), which occurred 58 days apart.

The second longest break was between UFC Fight Night: Lee vs. Oliveira (on March 14th, 2020) and UFC 249: Ferguson vs. Gaethje (on May 9th, 2020) with a total of 56 days between events.

Weight Classes

Here is the number of matches over time by weight class. The total number of matches are plotted by year. Prior to 1997, there were no specified weight class. UFC 12 on Feb 7th, 1997 was the first time they introduced weight classes with lightweight, middleweight, and heavyweight.

It wasn’t until 2000-2001 that they introduced additional weight classes. It was UFC 28 on November 17th, 2000 that was the first UFC event that adopted the Unified Rules of MMA. These sanctioned rules introduced judges, time limits, rounds, strict weight classes, 10-point scoring system, as well as regulations on fighter’s gloves and apparel. The women’s division was introduced in 2013. Currently, the UFC has a total of twelve weight divisions, eight men’s divisions and four women’s divisions.

The plots were created using Python and Plot.ly. Full source code and dataset can be found in the project GitHub repository.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.