Beyond user feedback
Presented by Evelyn J. Boettcher, Founder of DiDacTex, LLC
Developers & Data Analytics
A/B testing is a randomized controlled experiment done in production.
There are two tests: A and B, in which a single variable adjusted (B Test).
This variation, might affect a user’s behavior.
Domain knowledge only gets you so far!
Subject-matter experts (SME) from Microsoft, Netflixs etc, find when they implement changes that only a small fraction of their ideas have the planned outcome (Sweet 2022).
Even though we are isolating a single variable, that variable interacts with a million other variables.
You simply can not model everything or know everything.
Say there is a grocer called Trader Janes and it wants to add a new pizza to its line up.
However, they need to keep the same number of types it sells a constant. Freezer only holds N pizzas.
They will have to remove one pizza from their lineup to add the new item.
A week later sales go down.
A/B Testing is used to clarify a vision, but does not create vision.
For example, an ophthalmologist quickly gives you a set of two choices; 1 or 2 (2 or 3) that lead to sharpen vision. Their test, like A/B, can not give you vision.
The First Way: Principles of Flow
The Second Way: Principles of Feedback
Third Way: Principles of Continuous Learning
A/B Testing is an extension of the
second and third ways.
Feedback will be the results of the A/B Testing.
Need to create the smallest, fastest A/B Test that is statistically meaningful.
Want an N samples that show a 5% false positive and 20% false negative.
\[ N = ? \]
Defines minimum number of samples (N) as:
\[ N > 2.48 \left( \frac{\sigma_\Delta}{\Delta} \right)^2 \]
In addition, our testing and product should do no harm.
\[ N > 2.48 \left( \frac{\sigma_\Delta}{\Delta} \right)^2 \]
\[ \Delta = 0 \\ \Delta = B - A \]
False positive
False negative
\(\alpha\) == False Positive rate
You can assume B is better than A
\(\beta\)== False Negative rate
You can assume A is better than B
Mean is 0
Standard deviation: \(\sigma\)
\(z_{score}\) measures the distance between a point and the mean in units of \(\sigma\)
\(Z_{score}\) = -1.64 (5%)
\(Z_{score}\) = 0.84 (20%)
\[ 1.64 + 0.84 = 2.48 \]
\(\alpha\) == False Positive rate
Assume: B is better than A
\(\beta\)== False Negative rate
Assume: A is better than B
we have our minimal viable test.
But we are not done yet.
One more thing to worry about
If we implement B, how F–ed up will that make the users?
If we tested B, how F–ed up will that make the users?
A/B Testing is being done on users without consent, knowledge and at scale (100K of users).
Group mindset has been around since the 1950’s. Current research shows that our minds physically change when we work together (Hughes, n.d.) socially.
So it is scary to read
There are already strong standard for testing on human subjects.
There is the IRB (Internal review board) preprocess.
It has required and continuous training and certification: CiTI training.
Only required for companies receiving federal government funding: Universities, Air Force, Army etc.
Not required for companies that work with schools (state and local funding) and social media companies.
In light of this, I propose researchers use the F potential.
(Currently, not a real thing. Just something to think about.)
\[ F_{upped} = \begin{cases} \text{1,} &\quad\text{if seriously harmed}\\ \text{0.5,} &\quad\text{if slightly harmed} \\ \text{0,} &\quad\text{if not measurable}\\ \text{-0.5,} &\quad\text{becomes better} \\ \text{-1,} &\quad\text{becomes a lot better} \\ \end{cases} \]
If \(F_{upped}\) > 0, test should be a no-go.
If \(F_{upped}\) < 0, \(\Delta\) should be halved.
Example: Trader Jane’s Pizza needs sales to increase by 3% (\(\Delta\)).
If pizza made people better, then \(\Delta=1.5\%\).
Biases can increase the F-potential.
Luckily, A/B Testing can help with both unseen and seen bias.
I know of three small business that where started by young women in the Dayton area.
Their original logo design used beautifully detailed font.
Unfortunately, this detailed font would make it difficult for people like me (over 40) to read it.
They literally could see their logo.
Their logo was not readable to me when I drove by!
These women (All of whom where lovely and kind) did not know that they made a logo could not be read by me.
The human eye’s ability to resolve a spatial frequency is dependent on contrast. This contrast threshold function will change with age.
Starbucks Logo has evolved to reduce high spatial information.
(2022)
Most consonants spoken are in the 400 to 4500Hz.
With Women having most of their consonants sounds showing up in the higher frequency’s.
Green Bar shows the cutoff for the voice spectrum
This caused women’s voices to sound shrill.
It also made it hard to understand what they said, since their voice was cut off.
Margret Thatcher during her career changed her voice.
Women’s voices have dropped on average over 23Hz from 1945 to 1993. (Cecilia Pemberton 1998)
Women’s voices have been becoming more manly.
In 1927 a voice spectrum had to be defined.
J.C Steinberg (from AT&T) knew that the proposed voice frequency cut off women’s voices. He wrote a letter titled “Understanding Women”.
He states that men traditionally have an inability to understand women except when their tone is soft.
So, it is a “biological failing of women” (Tallon, n.d.) that we can’t understand them. \(\therefore\) The technology as is, is good.
In A/B Testing, we focus on the question does doing A or B make the company more successful.
When another company/technology comes along that cover’s women’s voices better, it is reasonable to assume that they will get that market share.
Note: Narrow Band (free to use) VOIP is 300 to 3,400Hz.
With more diverse workforce, a research(er) bias will go down.
Google and Apple had a hard time getting voice recognition to work for kids (Scanlon, n.d.)
Not only do kids speak at higher frequencies than women, they have different speaking patterns.
One can not simply take an adult’s voice and shift the frequency. So ML/AI have a hard time figuring what kids are saying.
The market for voice recognition for kids looks to have a strong market growth.
A/B testing is a randomized controlled experiment done in production.
There are two tests: A and B, in which a single variable adjusted (B Test).
This variation, might affect a user’s behavior.
No free lunch.
Even after testing, test results might not make the company more successful.
Please check out Gem City Tech.
Gem City TECH’s mission is to grow the local industry and the community by providing a centralized destination for technical training, workshops and providing a forum for collaborating.
Evelyn Boettcher ejb@DiDacTex.com
Taste of IT Conference