Gem City Tech ML/AI Meetup
Presented by Evelyn J. Boettcher, DiDacTex, LLC
Dec. 15 2022
We meet every third Thursday and are part of Gem City Tech meetup group
GemCity TECH’s mission is to grow the local industry and the community by providing a centralized destination for technical training, workshops and providing a forum for collaborating.
Currently, support several special interest groups from a variety of technical disciplines.
Gem City Tech ML/AI: Third Thursday at 6.
The GemCity TECH Meetup calendar of upcoming events: www.meetup.com/gem-city-tech
May 18th will be at 1pm at WBI.
It cost money to run experiment, train algorithms and implement algorithms.
We saw this in March’s meetup where initially one needed to upgrade to Pro Colab. But then we saw that we could reduce sample size and runtime. Which might have allowed us to scrapped by with the free version of Colab.
There are mathematical models you can use to identify what is the number of sample you need to get good results.
In addition, not knowing how to adjust your experiment to meet the risk, can cost you your job and or the company a lot of money.
Recently, Google debuted their version of ChatGPT. This was a high risk demo. It also failed and cost the company $100B in valuation.
You need to know how to adjust your experiments for the risk.
This is where design experimentation comes in. How does one adjust for risk while reducing the cost of running the experiment.
One way to reduce cost and risk is to set-up an A/B Test.
A/B testing is a randomized controlled experiment done in production.
There are two tests: A and B, in which a single variable adjusted (B Test).
This variation, might affect a user’s behavior.
Domain knowledge only gets you so far!
Subject-matter experts (SME) from Microsoft, Netflixs etc, find when they implement changes that only a small fraction of their ideas have the planned outcome (Sweet 2022).
Even though we are isolating a single variable, that variable interacts with a million other variables.
You simply can not model everything or know everything.
Say there is a grocer called Trader Janes and it wants to add a new pizza to its line up.
However, they need to keep the same number of types it sells a constant. Freezer only holds N pizzas.
They will have to remove one pizza from their lineup to add the new item.
A week later sales go down.
A/B Testing is used to clarify a vision, but does not create vision.
For example, an ophthalmologist quickly gives you a set of two choices; 1 or 2 (2 or 3) that lead to sharpen vision. Their test, like A/B, can not give you vision.
A/B Testing is an extension of DevOps.
Feedback will be the results of the A/B Testing.
Need to create the smallest, fastest A/B Test that is statistically meaningful.
Want an N samples that show a 5% false positive and 20% false negative.
\[ N = ? \]
Defines minimum number of samples (N) as:
\[ N > 2.48 \left( \frac{\sigma_\Delta}{\Delta} \right)^2 \]
In addition, our testing and product should do no harm.
\[ N > 2.48 \left( \frac{\sigma_\Delta}{\Delta} \right)^2 \]
\[ \Delta = 0 \\ \Delta = B - A \]
False positive
False negative
\(\alpha\) == False Positive rate
You can assume B is better than A
\(\beta\)== False Negative rate
You can assume A is better than B
Mean is 0
Standard deviation: \(\sigma\)
\(z_{score}\) measures the distance between a point and the mean in units of \(\sigma\)
\(Z_{score}\) = -1.64 (5%)
\(Z_{score}\) = 0.84 (20%)
\[ 1.64 + 0.84 = 2.48 \]
\(\alpha\) == False Positive rate
Assume: B is better than A
\(\beta\)== False Negative rate
Assume: A is better than B
we have our minimal viable test.
But we are not done yet.
One more thing to worry about
If we implement B, how F–ed up will that make the users?
If we tested B, how F–ed up will that make the users?
A/B Testing is being done on users without consent, knowledge and at scale (100K of users).
Group mindset has been around since the 1950’s. Current research shows that our minds physically change when we work together (Hughes, n.d.) socially.
So it is scary to read
There are already strong standard for testing on human subjects.
There is the IRB (Internal review board) preprocess.
It has required and continuous training and certification: CiTI training.
Only required for companies receiving federal government funding: Universities, Air Force, Army etc.
Not required for companies that work with schools (state and local funding) and social media companies.
In light of this, I propose researchers use the F potential.
(Currently, not a real thing. Just something to think about.)
\[ F_{upped} = \begin{cases} \text{1,} &\quad\text{if seriously harmed}\\ \text{0.5,} &\quad\text{if slightly harmed} \\ \text{0,} &\quad\text{if not measurable}\\ \text{-0.5,} &\quad\text{becomes better} \\ \text{-1,} &\quad\text{becomes a lot better} \\ \end{cases} \]
If \(F_{upped}\) > 0, test should be a no-go.
If \(F_{upped}\) < 0, \(\Delta\) should be halved.
Example: Trader Jane’s Pizza needs sales to increase by 3% (\(\Delta\)).
If pizza made people better, then \(\Delta=1.5\%\).
Biases can increase the F-potential.
Luckily, A/B Testing can help with both unseen and seen bias.
I know of three small business that where started by young women in the Dayton area.
Their original logo design used beautifully detailed font.
Unfortunately, this detailed font would make it difficult for people like me (over 40) to read it.
They literally could see their logo.
Their logo was not readable to me when I drove by!
These women (All of whom where lovely and kind) did not know that they made a logo could not be read by me.
The human eye’s ability to resolve a spatial frequency is dependent on contrast. This contrast threshold function will change with age.
Starbucks Logo has evolved to reduce high spatial information.
(2022)
With more diverse workforce, a research(er) bias will go down.
Google and Apple had a hard time getting voice recognition to work for kids (Scanlon, n.d.)
Not only do kids speak at higher frequencies than women, they have different speaking patterns.
One can not simply take an adult’s voice and shift the frequency. So ML/AI have a hard time figuring what kids are saying.
The market for voice recognition for kids looks to have a strong market growth.
A/B testing is a randomized controlled experiment done in production.
There are two tests: A and B, in which a single variable adjusted (B Test).
This variation, might affect a user’s behavior.
No free lunch.
Even after testing, test results might not make the company more successful.
Gem City Tech ML/AI