Old Faithful has long been known for its large geyser plume, and its frequency and regularity of eruptions. Though common lore suggests that it erupts every hour, on the hour, its eruptions are more dynamic, happening roughly every 90 minutes.
Predicting the geyser's next time of eruption is important to the Yellowstone park service as it allows visitors to get in position to watch the event. The current forecasting algorithm for the time until next eruption is based on a simple regression off of the duration of the previous eruption. The distribution of time to next eruption is bimodal, with a peak at 65 minutes following short eruptions, and 92 minutes following long eruptions.
After visiting Yellowstone in 2018, I wondered if a neural network might be able to tease out more subtle correlations in the eruption patterns to be able to forecast more accurately.
The ideal dataset for prediction would be temperature of the mouth of the geyser sampled frequently. This would capture the precise time of each eruption, as well as the duration of that eruption. Unfortunately, there are no public datasets of this nature available for Old Faithful.
For this model, I found a dataset from geyserstudy.org that logged the time of each eruption from 2000-2011, and a dataset from geysertimes.org that logged the temperature in a channel away from the mouth of the geyser every minute from April through October of 2015. The eruption time dataset contains ~58,000 eruptions, wheras the shorter duration of the eruption temperature dataset means it only has ~7,100 eruptions.
Unfortunately, neither of these datasets capture the duration of each eruption, which is the most notable predictor of next eruption time. The geyser time dataset only logs that time at which the eruption started. The channel temperature dataset comes closer, but the low sampling frequency (sampled once per minute) and dependence on the volume of water in the channel makes it more difficult to explicitly find the duration of an eruption.
With the datasets in hand, I needed to structure an appropriate objective for the model to optimize towards. Trying to frame your question in the best possible way for the network to learn is one of the most interesting parts of machine learning for me. As described in earlier, the overall goal is to be able to predict what time the next eruption will take place. Using the temperature dataset, I could ask the network to implicitly give me this answer in a few different ways.
Using the eruption interval dataset, the model's objective is more straightforward. Given a set of prior eruption intervals, predict the next element in the series.
With the datasets in hand, I needed to structure an appropriate objective for the model to optimize towards. Trying to frame your question in the best possible way for the network to learn is one of the most interesting parts of machine learning for me. As described in earlier, the overall goal is to be able to predict what time the next eruption will take place. Using the temperature dataset, I could ask the network to implicitly give me this answer in a few different ways.
Using the eruption interval dataset, the model's objective is more straightforward. Given a set of prior eruption intervals, predict the next element in the series.