Variable Length, Zipf's Law and Density Functions

Jun 11, 2014

When coding, variables should be named based on both frequency and the length of their context. Variables, structures and other constructs should have a name length that follows to Zipf's law as it applies to word length and frequency. Please glance at those references before continuing.

When these laws of linguistic use are ignored, readability is demonstrably diminished.

Consider a variable, the hated "x", used 100 times in a file. If "x" is used 100 times in 40 lines of mathematical manipulation, such as in a digest function, this would be acceptable... the high frequency of use strictly precludes a longer name. If, however, it was spread out across 10000 lines of code, a far longer name would be needed.

Of course, suppose that function was embedded in 10k lines of code. Does this change the freqency of x? No. One important rule is that, for the purposes of linguistic calculation, the frequency of any variables usage should be computed relative to the scope of it's use.

As a demonstration, cluttering a short for i=1 to 10 ... loop with for iterationIndex = 1 to 10, can cause code to be nigh unreadable.

Operation can often be obscured by foolish verbosity:

for iterationIndex = 1 to 10
    countArray[iterationIndex]=sizeArray[iterationIndex]*factorArray[iterationIndex] + computeDensity(histogram,iterationIndex)

Versus

for i = 1 to 10
    countArray[i]=sizeArray[i]*factorArray[i] + computeDensity(histogram,i)

Likewise stuffing common prefixes on sets of variables in order to lengthen them can hide their meaning in a limited scope:

for i = 1 to 10
    wordSuffixCountArray[i]=wordSuffixSizeArray[i]*wordSuffixFactorArray[i] + ComputeDensityForWordSuffix(currentWordSuffixHistogram,i)

It's clear that a minimum set of descriptive, distinct terms within the scope of use should, instead, be used.

Recently, software engineering pundits seem to take the stance that there is no such thing as a variable with too-long a name. This is false. Numerous linguistic studies on readability, word length and frequency, bear that out and so, of course, does common sense. Maximum clarity can be achieved in coding languages by using the same types of statistical distributions that occur in natural languages.

erik’s Substack

Discussion about this post