What can AI learn from market research about bias?

“Any researcher worth their salt was acutely aware of sample design and delivery on that design back in the 1970s. We knew that it would make or break our study and our competitive advantage lay in our data quality.”

So says Butch Rice, a South African market research industry pioneer who started one of that country’s most successful agencies, Research Surveys, before it was subsumed by TNS and later Kantar when I spoke to him about this article.

Over time, sampling has become such a fundamental and pervasive part of our industry that it has become almost an after-thought for many researchers. However, our once-dogged focus on sampling is beginning to rear its head in new and novel ways. With the large-scale adoption of AI technologies, we’re finding the need to think more carefully about the data samples that we use to train our AI models and we can learn from our industry’s diligent focus on sampling in the past.

Much has been said about bias in AI – bias towards specific races, genders, age groups and so on – but little has been said about the inherent biases in AI models that make them less relevant to the market research industry than they could be. There is a disconnect between the language of insights – the shared paradigm between researchers and clients that allows us to quickly convey concepts about brands – and the language of the internet that is encoded into many AI models. Industry stalwart, Larry Friedman, highlighted the disconnect between traditional brand concepts and new data sources like social media for me:

“Too often, researchers haven’t been careful enough when attempting to ‘translate’ established survey constructs like Brand Consideration or Purchase Intent into equivalent social metrics. How people discuss their interest in buying brands on Twitter may not correspond simply into a survey top-box score; it needs to be thought through very carefully.”

How we construct AI models

Often, we don’t give enough thought to this consilience, or lack thereof. Most AI models are created in one of two ways:

A training dataset that consists of inputs (variables, text verbatims, etc.) and associated examples of what we are trying to predict is used to train an AI model. Often these are datasets shared publicly by academia or industry, unless a company invests in creating its own.
A public model that encodes the relationships within a massive dataset is used to predict where our new, unseen data falls based on these previously encoded relationships. These datasets and models are often based on huge swathes of the internet

In the first scenario, traditional sampling considerations are very relevant when it comes to constructing a training dataset from scratch. Does the training dataset capture the same domain or paradigm as the unseen data that I am going to be applying the model to? Is the training dataset ‘representative’ of the way that whatever I am trying to predict falls out in real life? These kinds of questions will resonate with any market researcher and are vitally important to consider when constructing an AI model.

In the second scenario, rather than starting from scratch each time, AI practitioners increasingly leverage datasets and models shared in the public domain by organisations that have the resources to collect and process large portions of the internet.

Fine-tuning required?

For a long time, we thought that these models incorporated so much data that they captured universal relationships that could be leveraged in most contexts. However, what has become apparent over time is that these models, while amazing public goods, still need to be tweaked, or “fine-tuned”, to be sensitive towards specific contexts. This process is known as “transfer learning” – a popular field of AI research and application at the moment. These public models represent massive, solid foundations to build on, but we still need to choose the paint colour, window shape and roof style (to butcher an analogy).

Indeed, when it comes to these public models, the promise of big data was that we didn’t need to sample anymore. Why would we when we had access to “all” the data? And, indeed, many public resources that AI models are based on use so much of the internet to inform their magic. However, regardless of how big these datasets might be, the reality is that we seldom have all the data. When we do have access to large datasets, they are often biased towards specific domains; few ever capture the market research paradigm with its unique concepts like brand equity or common product attributes. Consumers on the internet just don’t frame their discussions around these topics in the same way we talk about them as brand researchers and owners.

Models created using millions (or even hundreds of millions) of tweets, for example, encode the semantic associations and perceptions of a self-selected group of vocal, passionate and opinionated people. Models based on IMDB, Amazon or Yelp reviews encode consumer language specific to certain categories and Wikipedia, boon to humanity that it is, captures dry relationships between facts that hardly reflect how real people talk or think.

Models based on these sources are used by most AI practitioners around the world. Released by companies such as OpenAI, Hugging Face, DeepMind, Google and Facebook, they are surely valuable public goods but they often fail to give us the quality that we need in the market research industry out the box when we apply them to our own text, voice, image and video data.

So what next?

This brings us back to the concept of sampling, or at least thinking deeply about the data that goes into your models, and market researchers can teach the AI community a think or two about diligence in this area.

When constructing training datasets or when fine-tuning public models, careful thought needs to be given to the alignment between domains – was the training dataset or public model created using data from a similar domain to what you are applying it to? If not, how should I go about creating a new training dataset that can be used to create a new model or fine-tune an existing one? How many classes (metrics, tags, KPIs, etc.) am I trying to predict? How many training examples do I need to cover all these classes appropriately? Where am I sourcing this data from? How am I going to go about coding this data? Will I do it in-house? Will I use a coding team? Will I use a crowd-sourcing company such as Amazon’s Mechanical Turk or Figure Eight? How many coders should review each document? And so on…

It’s important that AI practitioners do not overlook this crucial aspect of building AI models and there is much that experienced marketing scientists and other insights professionals can impart in this regard.

AI has been a bit like the Wild West but it’s time to straighten things up with a bit of market research wisdom.

Cookie	Type	Duration	Description
cli_user_preference	persistent	1 year	Keeps track of the cookie consents for on the current domain.
cookielawinfo-checkbox-marketing	persistent	1 year	Keeps track of the cookie consent for a specific category on the current domain.
cookielawinfo-checkbox-measurement	persistent	1 year	Keeps track of the cookie consent for a specific category on the current domain.
cookielawinfo-checkbox-necessary	persistent	1 year	Keeps track of the cookie consent for a specific category on the current domain.
cookielawinfo-checkbox-non-necessary	0	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Non Necessary".
cookielawinfo-checkbox-preferences	persistent	1 year	Keeps track of the cookie consent for a specific category on the current domain.
hustle_module_show_count-	persistent	1 day	This cookie is used to determine when the internal slide-in/pop-up/embed module for newsletter opt-ins is displayed to the user.
inc_optin_	persistent	1 hour	This cookie is used to determine when the internal slide-in/pop-up/embed module for newsletter opt-ins is displayed or hidden to the user.
PHPSESSID	session	0 minute	Preserves user session state across page requests. The PHPSESSID cookie is native to PHP and enables websites to store serialised state data. On the website it is used to establish a user session and to pass state data via a temporary cookie, which is commonly referred to as a session cookie. Stores unique session ID.
viewed_cookie_policy	persistent	1 hour	Stores the user's cookie consent state for the current domain.
viewed_cookie_policy	0	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
wordpress_	session	session	WordPress cookie for a logged in user.
wordpress_logged_in_	session	session	WordPress cookie for a logged in user.
wordpress_test_	session	session	WordPress cookie for a logged in user.
wordpress_test_cookie	session	session	WordPress test cookie.
wp-settings-	session	session	Wordpress also sets a few wp-settings-[UID] cookies. The number on the end is your individual user ID from the users database table. This is used to customize your view of admin interface, and possibly also the main site interface.
wp-settings-time-	session	session	Wordpress also sets a few wp-settings-{time}-[UID] cookies. The number on the end is your individual user ID from the users database table. This is used to customize your view of admin interface, and possibly also the main site interface.

Cookie	Type	Duration	Description
AMP_TOKEN	persistent	1 year	This cookie name is associated with Google Universal Analytics - which is a significant update to Google's more commonly used analytics service. It contains a token that can be used to retrieve a Client ID from AMP Client ID service. Other possible values indicate opt-out, inflight request or an error retrieving a Client ID from AMP Client ID service.
collect	third party	session	Used to send data to Google Analytics about the visitor's device and behaviour. Tracks the visitor across devices and marketing channels.
_ga	persistent	2 year	Registers a unique ID that is used to generate statistical data on how the visitor uses the website.
_gid	persistent	1 day	Registers a unique ID that is used to generate statistical data on how the visitor uses the website.
__gads	third party	2 years	Associated with the DoubleClick for Publishers service from Google. It serves purposes such as measuring interactions with the ads on our domain and preventing the same ads from being shown to you too many times.
__utma	persistent	2 years	This cookie is typically written to the browser upon the first visit. If the cookie has been deleted by the browser operator, and the browser subsequently visits strategy-business.com, a new __utma cookie is written with a different unique ID. In most cases, this cookie is used to determine unique visitors to strategy-business.com, and it is updated with each page view. Additionally, this cookie is provided with a unique ID that Google Analytics uses to ensure both the validity and the accessibility of the cookie as an extra security measure.
__utmb	persistent	30 minutes	This cookie is typically written to the browser upon the first visit. If the cookie has been deleted by the browser operator, and the browser subsequently visits strategy-business.com, a new __utma cookie is written with a different unique ID. In most cases, this cookie is used to determine unique visitors to strategy-business.com, and it is updated with each page view. Additionally, this cookie is provided with a unique ID that Google Analytics uses to ensure both the validity and the accessibility of the cookie as an extra security measure.
__utmc	persistent	30 minutes	Historically, this cookie operated in conjunction with the __utmb cookie to determine whether or not to establish a new session for the user. For backward compatibility purposes with sites still using the urchin.js tracking code, this cookie will continue to be written and will expire when the user exits the browser. However, if you are debugging your site tracking and you use the ga.js tracking code, you should not interpret the existence of this cookie in relation to a new or expired session.
__utmv	persistent	2 years	This cookie is not normally present in a default configuration of the tracking code. The __utmvcookie passes the information provided via the _setVar() method, which you use to create a custom user segment. This string is then passed to the Analytics servers in the GIF request URL via the utmcc parameter. This cookie is written only if you have added the¬_setVar() method for the tracking code on your website page.
__utmz	persistent	6 months	This cookie stores the type of referral used by the visitor to reach strategy-business.com, whether via a direct method, a referring link, a website search, or a campaign such as an ad or an email link. It is used to calculate search engine traffic, ad campaigns, and page navigation within strategy-business.com. The cookie is updated with each page view to strategy-business.com.

Cookie	Type	Duration	Description
GoogleAdServingTest	persistent	session	Used to register what ads have been displayed to the user.
IDE	persistent	1 year	Used by Google DoubleClick to register and report the website user's actions after viewing or clicking one of the advertiser's ads with the purpose of measuring the efficacy of an ad and to present targeted ads to the user.
test_cookie	third party	1 day	Used to check if the user's browser supports cookies.
__ab12#	persistent	2 years	Pending

Top 10 Global Consumer Trends 2020

Top 10 Global Consumer Trends 2021

Understanding the Why? Projective Techniques in Qualitative…

African consumers resistance to e-commerce and what is…

The fascinating dynamism of the African Insights industry

Christmas 2020: Opportunities to close the year on…

Make your customer experience meaningful, not only frictionless

There Is a Way Out of This Mess

Nail Biting in Georgia US Senate Races –…

Media polling and the way forward

U.S. election pollsters: watch Florida for key indicators!

Post-pandemic marketing & advertising trends among marketers

Cross-Media Measurement, XMM: no viewing – no outcomes!

XMM Disconnect? As Alice went into Wonderland, things…

Innovations in media measurement, accelerated by COVID, establish…

Insight from the Insight250 winners: Data-driven leadership

Insights from the Insight250 winners: Evolutions and innovations…

Customer advocacy: How to turn customers into friends,…

Brands as provocations: How to connect at scale…

Predictive qual: How to turn the art of…

What It truly means to be tech-enabled in…

Insights on insights: Which survey data analysis solution…

Eating in, is the new testing out –…

Behavioural tech-heads: What technology needs to learn from…

SHOBSERVATORY Research Chronicles: The heart of the brand…

ESOMAR announces the 2021 award winners

SHOBSERVATORY Research Chronicles: How presentations are created

What can AI learn from market research about bias?

How we construct AI models

Fine-tuning required?

So what next?

Leave a Comment Cancel Reply

Predictive qual: How to turn the art of qual into a science...

What can AI learn from market research about bias?

How we construct AI models

Fine-tuning required?

So what next?

Leave a Comment Cancel Reply

Related Articles

We value your privacy!