How to avoid confusing the terminology in Machine Learning

Norway / Utforsk / Blog posts / How to avoid confusing the terminology in Machine Learning

In this blog post, our test experts Eva Holmquist and Rik Marselis differentiate between different testing terminologies to avoid confusion and for better understanding of the terms.

The terminology may seem like a trivial subject, but there have been and will continue to be a lot of misunderstandings when people use conflicting terminology. There is one example in the subject of machine learning which often is the base for misunderstandings. This concerns the term test data.

The problem often occurs when data scientists and testers talk about test data. For a data scientist who trains a machine learning algorithm, it is very clear what test data is. It’s also very clear what test data is for a tester who tests a machine learning algorithm. Unfortunately, often they don’t mean the same thing. The next figure shows how (for as far as we have found) the terms “training data, test data and validation data” are commonly used among data scientists.

The data that is used to train the machine learning algorithm is split into training data and test data. The machine learning algorithm learns from the training data and (automatically) uses the test data to check if the learning was successful. After the first round of learning the original dataset is again split up in training data and test data (but differently) and the second round of learning starts. After many rounds of learning the algorithm is supposed to be ready with learning. Then a separate set of data (that the algorithm has never seen before) is used to validate if the algorithm indeed has properly learned and is ready for operational use.

If data scientists and testers are aware of this process and the fact that there are three different types of data involved, then they can easily agree on what name to use for which part of the data. Our impression is that data scientists prefer the terms as shown in the figure above, but testers usually would exchange the words test data and validation data because that seems to align better with common testing terminology.

Our advice is to always check terms and definitions as long as there’s no worldwide consensus. The need for discussions around terminology increases when you’re working with people from different backgrounds. Even the most trivial of terms can be the source of grave misunderstandings. So, an open discussion around the terms used can be crucial for the success of the project.

So, how do we avoid confusing the terminology? In machine learning, we use a lot of data for different purposes as we’ve discussed. One way to avoid misunderstandings is to gather everybody involved in a workshop. Start with discussing the different purposes you will use data for. This will result in a list. For each purpose, discuss what term you use for it. To find each different term used, each person should note the term he or she uses on a post-it. For some purposes there will be a common term used by everybody, but not for most. Discuss the purposes for which you use different terms and decide upon one of them. The result of this workshop will be a list of the purposes we use data for and the term we use for that data. When new people arrive, it’s easy to go through the list to avoid misunderstandings. This is a simple technique that has worked well for us.

This blog was written by Eva Holmquist (Sogeti Sweden) and Rik Marselis (Sogeti Netherlands).

Eva Holmquist is a senior test specialist at Sogeti. She has worked with activities from test planning to execution of tests. She has also worked with Test Process Improvement and Test Education. She is the author of the book "Praktisk mjukvarutestning" which is a book on Software Testing in Practice.

Rik Marselis is a test expert at Sogeti. He worked with many organizations and people to improve their testing practices and skills. Rik contributed to 19 books on quality and testing. His latest book is “Testing in the digital age; AI makes the difference” about testing OF and testing WITH intelligent machines.

CONTACT

Eva Holmquist
Senior Test Specialist
072-502 83 93

Eva Holmquist
Senior Test Specialist
072-502 83 93
Rik Marselis
Quality and Testing Consultant | Netherlands
+31 886 606 600

Rik Marselis
Quality and Testing Consultant | Netherlands
+31 886 606 600

Blogs

Our experts like!

Cookies	Description
Registered visitor cookie	Cookie given to each registered user.
Registered visitor functionality cookie	Cookies used to remember the unique identifier given to each registered user.
Social plug-in content sharing cookie	Cookies set by services such as Facebook Connect or Twitter Button, which allow social networks users to share the content of our websites on social networks.
Unregistered visitor cookie	Cookies used to give to unregistered users a unique identifier in order to recognize them and to analyze how they use the website.
Analytic cookie	Cookies used to store URLs of the previous page visited, enabling to track users navigating from inside or from outside the website. If you click on a Sogeti advertisement on a non-Sogeti website, a cookie may be used to log which website you are on, in order to ensure our advertisements are served effectively and to measure whether our advertisements are viewed. Google Analytics: cookies set by Google analytics are used for web analytical purpose, but are not used to track individual users. For further information on how Google Analytics collects and uses information on our behalf and the right to use such cookies, please refer to the Google Analytics products and services privacy statement. If you object to your Personal Data being collected by Google Analytics, you may download and install the Google Analytics Opt-out Browser Add-on. Pardot: cookies set by Pardot are used to track users on our website. Visits are tracked for known users only. Unknown users are recorded as anonymous users. Please refer to Pardot privacy policy for any further information on their use and your rights related to the use of such cookies.