Olympics Twitter Bot

Toy Project | Twitter bot to post when events are about to start

This project started when I realized Olympics schedule will be super confusing for Brazilians this year, since our timezone is completely different from Japan’s. So I decided to create a Twitter Bot to post when any event is about to start and, by turning notifications on, I could receivve an alert.

There are two main parts here:

  • Scraping the schedule
  • Creating the actual bot

All of that is made using Python3, Requests and Tweepy.

Scraping the schedule

To scrape the schedule, first I found a page with a proper table with all the data. Sporting News has it all, and with that we can go straight to the code.

Step 1 - Import libs

We need to import all the modules we’ll use:

import os
import pytz
import tweepy
import requests
import env_file
import pandas as pd
import dateutil.parser

import pandas as pd

from lxml import html
from time import sleep
from datetime import datetime, timedelta

Step 2 - Get the data

In order to get the data, we’ll use the Requests module to make the requests and lxml to parse the response. To make the requests and parse the response we simply do:

url = 'https://www.sportingnews.com/us/athletics/news/olympics-2021-start-schedule-opening-ceremony/9z5omct2mqe211c0ajna5tyj1'
response = requests.get(url)
tree = html.fromstring(response.text)

By inspecting the page we can find the table elements and their xpaths //div[@class=”content-element__table-container”]. Now for each table we can iterate over the rows and columns to get the data:

for table in tables:                   # iterate over the tables
    rows = table[0].xpath('.//tr')     # get all rows

    for row in rows[1:]:               # iterate over the rows
        cols = row.xpath('.//td')      # get all colunms for the current row

        sport = cols[0].text           
        event = cols[1].text
        time = cols[2].text

        new_row = {                    # create dict with the current row
            'Sport': sport,
            'Event': event,
            'Time': time,
            'Post': False
        }

        data.append(new_row)            # add new dict to the data

df = pd.DataFrame(data)                 # create dataframe from list

We can make one improvement when it comes to dealing with the dates. Not only they have an weird format, they’re also based in ET timezone. For that, we’ll create a new function that receives the text we got from the table, the amount of days since the start of the olympics and the olympics start date.

def parse_time(time, days_since, start_day):
    day = start_day + timedelta(days=days_since)                                                   # calculate day
    
    start_time, end_time = time.replace('.', '').replace(u'\xa0', ' ').replace(' ', '').split('-') # clean text
    
    if ':' in start_time:                                                                          # parses text according to the minutes format
        start_time = datetime.strptime(start_time, '%I:%M%p')
    else:
        start_time = datetime.strptime(start_time, '%I%p')
        
    start_time = start_time.replace(year=day.year, month=day.month, day=day.day)                   # add date info to the time
    
    if ':' in end_time:
        end_time = datetime.strptime(end_time, '%I:%M%p')
    else:
        end_time = datetime.strptime(end_time, '%I%p')
    
        
    end_time = end_time.replace(year=day.year, month=day.month, day=day.day)
    
    if end_time < start_time:                                                                      # check if the end is in the following day 
        end_time += timedelta(days=1)
    
    start_time = pytz.timezone('US/Eastern').localize(start_time)                                  # add timezone
    end_time = pytz.timezone('US/Eastern').localize(end_time)
    return start_time, end_time

And the last thing is to change a bit our get function.

def get(self):
    if not os.path.isdir('data'):
        os.mkdir('data')

    url = 'https://www.sportingnews.com/us/athletics/news/olympics-2021-start-schedule-opening-ceremony/9z5omct2mqe211c0ajna5tyj1'
    response = requests.get(url)
    tree = html.fromstring(response.text)
    tables = tree.xpath('//div[@class="content-element__table-container"]')
    data = []
    start_day = datetime(day=20, month=7, year=2021)

    for i, table in enumerate(tables[1:]):
        rows = table[0].xpath('.//tr')

        for row in rows[1:]:
            cols = row.xpath('.//td')

            sport = cols[0].text
            event = cols[1].text
            start_time, end_time = self.parse_time(cols[2].text, i, start_day)

            new_row = {
                'Sport': sport,
                'Event': event,
                'Start Time': start_time,
                'End Time': end_time,
                'Post': False
            }

            data.append(new_row)

    df = pd.DataFrame(data)
    return df

Step 3 - Create Bot

Now that we have the schedule, we can go ahead and create a bot to post it. The first thing I usually do is create a Bot class, in order to authenticate and create useful wrappers:

class TwitterBot:
    def __init__(self, creds_path=None):
        self.api = self.get_api(creds_path)
    
    # assumes you have a .env file with valid credentials
    def get_api(self, creds_path):
        creds = env_file.get('.env')
        api_key = creds['API_KEY']
        api_secret = creds['API_SECRET']
        access_token = creds['ACCESS_TOKEN']
        access_token_secret = creds['ACCESS_TOKEN_SECRET']

        auth = tweepy.OAuthHandler(api_key, api_secret) 
        auth.set_access_token(access_token, access_token_secret)
        api = tweepy.API(auth)
        return api
    
    def post_tweet(self, tweet):
        self.api.update_status(tweet)

We can now create a function to check if there’s any event happening soon that should be posted:

def check_post(schedule, bot):
    ET = pytz.timezone('US/Eastern')

    # iterate over the schedule checking for events happening in a 20-second window from now that has not yet been posted
    for i, row in schedule.iterrows():
        now = datetime.now(ET)
        
        if abs(row['Start Time'] - now) < timedelta(seconds=20) and not row['Post']:
            schedule.loc[i, 'Post'] = True
            post_text = get_post_text(row)
            bot.post_tweet(post_text)

The last thing is to create a main function to keep a loop on that:

if __name__ == '__main__':
    bot = TwitterBot()

    while True:
        check_post(schedule, bot)
        sleep(5)

The end

The project is not that complex, but yet uses a lot of different technologies and modules. It was nice doing it to practice and to have some fun =D