Linguistic Analysis

By: Ian Makgill

To conduct the linguistic analysis we took the narratives supplied by users and assessed it to establish which colours had the strongest associations with different words. To complete this task we had to reduce the text to words with a pertinent interpretation, and then group them into a normalised data set so that variations of the same word could be grouped successfully.

In order to remove words such as “it”, “and” and “the”, and other common words without descriptive meanings, we used a list of English stop words that allowed us to remove these common words while retaining the most descriptive words.

The Colour of Happy, R240 G190 B2

To allow for variations in spelling, we stemmed the words to normalise the use of specific words. So we allowed all the variations of a word, to be reduced to a single word, so “Happy”, “Happier”, “Happiest” and “Happiness” were all counted as the word “Happy”. We used the Porter stemming algorithm to undertake this normalisation.

Once complete, we took the most popular stemmed words and created new images from the lists of colours that had utilised the stemmed words in their analysis. These images were again subjected to K-Means clustering to extract the three centroid colours for each word.

Most frequent stemmed words used in descriptions: Happy (1,305), Calm (1,257), Bright (1,098), Warm (899), Sky (791), Sea (660), Fresh (543), Vibrant (542), Summer (491), Beautiful (414), Day (373), Deep (357), Shade (349), Rich (322), Eye (316), Time (291), Life (288), Cool (284), Nature (266), Bold (252), Ocean (250), Pretty (239), Dark (224), Cheerful (215), Peaceful (202), Sunny (198), Fun (194), Perfect (190), Spring (187), Soothe (186), Wear (181), Great (180), Warmth (180), Light (179), Water (173), Strong (168), Smile (166), Work (161), Positive (156).

To read the full report contact G . F Smith to order a physical copy.


Ian Makgill is the Founder and Analyst of Spend Network, a company that specialises in combining complex and inconsistent data sets, cleansing and linking the data.