
With a cutoff of 5, I’d choose a random option for about 1 in every 20 decisions I make with the algorithm. I chose 5 as the cutoff point because it seemed like a reasonable frequency for occasional randomness. For the enterprising, there is a further optimization process to decide what cutoff to use, and even change the cutoff as learning continues. Your best bet is usually to try a few values and see which works best. Reinforcement learning algorithms sometimes act randomly because they rely on past experience. Always choosing the best option predicted can mean missing out on better options that were never tried before.
I doubt that this algorithm can actually improve my life. But an optimization framework backed by mathematical proofs, peer-reviewed papers, and billions in Silicon Valley revenue makes perfect sense to me. How exactly does it fall apart in practice?
8:30 am
first decision? Did I wake up at 8:30 like I planned. I turn off the alarm, turn on the RNG, and hold my breath as it spins and spits out… a 9!
Now for the big question: Did sleeping in or waking up on time have better outcomes for me in the past? My gut tells me to skip any reasoning and go straight to sleep in, but to be fair, I’m trying to ignore that and sort out my vague memories of morning naps.happiness in bed used to be I decided that as long as I didn’t miss anything important, it was more important than a leisurely weekend morning.
9:00 AM
I had a group project meeting in the morning and finished some machine learning reading before the meeting (“Bayesian Deep Learning with Subnetwork Inference”, anyone?), so I couldn’t sleep for long. The RNG instructed me to decide whether to skip the meeting based on past experience; I chose to attend. To decide whether to read or not, I rolled again and got a 5, which means I would randomly choose between reading and skipping reading.
It’s a small decision, but one I’m surprisingly nervous about rolling another random number on my phone. If I get 50 points or less, I skip the reading to respect the “explore” part of the decision algorithm, but I really don’t want to. Obviously, escape reading is only fun if you do it on purpose.
I hit the generate button.
65. After all, I can read.
11:15 am
I made a list of options for how to spend the vast amount of free time I now face. I can walk to a distant coffee shop I’ve always wanted to visit, call home, start my homework, look at PhD programs to apply to, go down an unrelated Internet rabbit hole, or just take a nap. The RNG produces big numbers – I need to decide what to do based on the data.
This is the first decision of the day, than Yes or No, when I started getting confused about how “preferred” each option was, it became clear that I couldn’t make an accurate estimate. When an AI agent makes a decision following an algorithm like mine, computer scientists have told it what is “preferable”. They translate the agent’s experience into a reward score that the AI then tries to maximize, such as “time surviving in a video game” or “money earned in the stock market.” However, reward functions can be difficult to define. Smart cleaning robots are a typical example. If you instruct the robot to simply throw out as much trash as it can, it can learn to knock over the trash can and throw the same trash out again to increase its score.