Qualitative variation

An index of qualitative variation (IQV) is a measure of statistical dispersion in nominal distributions. There are a variety of these, but they have been relatively little-studied in the statistics literature. The simplest is the variation ratio, while more complex indices include the information entropy.

Properties

There are several types of indexes used for the analysis of nominal data. Several are standard statistics that are used elsewhere - range, standard deviation, variance, mean deviation, coefficient of variation, median absolute deviation, interquartile range and quartile deviation.

In addition to these several statistics have been developed with nominal data in mind. A number have been summarized and devised by Wilcox (Wilcox 1967), (Wilcox 1973), who requires the following standardization properties to be satisfied:

Variation varies between 0 and 1.
Variation is 0 if and only if all cases belong to a single category.
Variation is 1 if and only if cases are evenly divided across all category.^[1]

In particular, the value of these standardized indices does not depend on the number of categories or number of samples.

For any index, the closer to uniform the distribution, the larger the variance, and the larger the differences in frequencies across categories, the smaller the variance.

Indices of qualitative variation are then analogous to information entropy, which is minimized when all cases belong to a single category and maximized in a uniform distribution. Indeed, information entropy can be used as an index of qualitative variation.

One characterization of a particular index of qualitative variation (IQV) is as a ratio of observed differences to maximum differences.

Wilcox's indexes

Wilcox gives a number of formulae for various indices of QV (Wilcox 1973), the first, which he designates DM for "Deviation from the Mode", is a standardized form of the variation ratio, and is analogous to variance as deviation from the mean.

ModVR

The formula for the variation around the mode (ModVR) is derived as follows:

M=\sum _{{i=1}}^{K}(f_{m}-f_{i})

where f_m is the modal frequency, K is the number of categories and f_i is the frequency of the i^th group.

This can be simplified to

M=Kf_{m}-N

where N is the total size of the sample.

Freeman's index (or variation ratio) is^[2]

v=1-{\frac {f_{m}}{N}}

This is related to M as follows:

{\frac {({\frac {f_{m}}{N}})-{\frac {1}{K}}}{{\frac {N}{K}}{\frac {(K-1)}{N}}}}={\frac {M}{N(K-1)}}

The ModVR is defined as

ModVR=1-{\frac {Kf_{m}-N}{N(K-1)}}={\frac {K(N-f_{m})}{N(K-1)}}={\frac {Kv}{K-1}}

where v is Freeman's index.

Low values of ModVR correspond to small amount of variation and high values to larger amounts of variation.

When K is large, ModVR is approximately equal to Freeman's index v.

RanVR

This is based on the range around the mode. It is defined to be

RanVR=1-{\frac {f_{m}-f_{l}}{f_{m}}}={\frac {f_{l}}{f_{m}}}

where f_m is the modal frequency and f_l is the lowest frequency.

AvDev

This is an analog of the mean deviation. It is defined as the arithmetic mean of the absolute differences of each value from the mean.

AvDev=1-{\frac {1}{2N}}{\frac {K}{K-1}}\sum _{{i=1}}^{K}|f_{i}-{\frac {N}{K}}|

MNDif

This is an analog of the mean difference - the average of the differences of all the possible pairs of variate values, taken regardless of sign. The mean difference differs from the mean and standard deviation because it is dependent on the spread of the variate values among themselves and not on the deviations from some central value.^[3]

MNDif=1-{\frac {1}{N(K-1)}}\sum _{{i=1}}^{{K-1}}\sum _{{j=i+1}}^{K}|f_{i}-f_{j}|

where f_i and f_j are the i^th and j^th frequencies respectively.

The MNDif is the Gini coefficient applied to qualitative data.

VarNC

This is an analog of the variance.

VarNC=1-{\frac {1}{N^{2}}}{\frac {K}{(K-1)}}\sum (f_{i}-{\frac {N}{K}})^{2}

It is the same index as Mueller and Schussler's Index of Qualitative Variation^[4] and Gibbs' M2 index.

It is distributed as a chi square variable with K - 1 degrees of freedom.^[5]

StDev

Wilson has suggested two versions of this statistic.

The first is based on AvDev.

StDev_{1}=1-{\sqrt {{\frac {\sum _{{i=1}}^{K}(f_{i}-{\frac {N}{K}})^{2}}{(N-{\frac {N}{K}})^{2}+(K-1)({\frac {N}{K}})^{2}}}}}

The second is based on MNDif

StDev_2 = 1 - \sqrt{ \frac{ \sum^{ K - 1 }_{ i = 1 } \sum^K_{j = i + 1 } ( f_i - f_j )^2 }{ N^2 ( K - 1 )} }

HRel

This index was originally developed by Claude Shannon for use in specifying the properties of communication channels.

HRel={\frac {-\sum p_{i}log_{2}p_{i}}{\log _{2}K}}

where p_i = f_i / N.

This is equivalent to Information Entropy divided by the $\log_2(K)$ and is useful for comparing relative variation between frequency tables of multiple sizes.

B index

Wilcox adapted a proposal of Kaiser^[6] based on the geometric mean and created the B index. The B index is defined as

B = 1 - \sqrt{ 1 - [ \sqrt[k] { \Pi_{ i = 1 }^k \frac{ f_i K }{ N } } ]^2 }

R packages

Several of these indices have been implemented in the R language.^[7]

Gibb's indices and related formulae

Gibbs et al proposed six indexes.^[8]

M1

The unstandardized index (M1) (Gibbs 1975, p. 471) is

M1=1-\sum _{{i=1}}^{K}p_{i}^{2}

where K is the number of categories and $p_{i}=f_{i}/N$ is the proportion of observations that fall in a given category i.

M1 can be interpreted as one minus the likelihood that a random pair of samples will belong to the same category (Lieberson 1969, p. 851), so this formula for IQV is a standardized likelihood of a random pair falling in the same category. This index has also referred to as the index of differentiation, the index of sustenance differentiation and the geographical differentiation index depending on the context it has been used in.

M2

A second index is the M2^[9](Gibbs 1975, p. 472) is:

M2={\frac {K}{K-1}}\left(1-\sum _{{i=1}}^{K}p_{i}^{2}\right)

where K is the number of categories and $p_{i}=f_{i}/N$ is the proportion of observations that fall in a given category i. The factor of ${\frac {K}{K-1}}$ is for standardization.

M1 and M2 can be interpreted in terms of variance of a multinomial distribution (Swanson 1976) (there called an "expanded binomial model"). M1 is the variance of the multinomial distribution and M2 is the ratio of the variance of the multinomial distribution to the variance of a binomial distribution.

M4

The M4 index is

M4={\frac {\sum _{{i=1}}^{K}|X_{i}-m|}{2\sum _{{i=1}}^{K}X_{i}}}

where m is the mean.

M6

The formula for M6 is

M6=K\left[1-{\frac {\sum _{{i=1}}^{K}|X_{i}-m|}{2N}}\right]

· where K is the number of categories, X_i is the number of data points in the i^th category, N is the total number of data points, || is the absolute value (modulus) and

m={\frac {\sum _{{i=1}}^{K}X_{i}}{N}}

This formula can be simplified

M6=K\left[1-{\frac {\sum _{{i=1}}^{K}|p_{i}-{\frac {1}{N}}|}{2}}\right]

where p_i is the proportion of the sample in the i^th category.

In practice M1 and M6 tend to be highly correlated which militates against their combined use.

Related indices

The sum

\sum _{{i=1}}^{K}p_{i}^{2}

has also found application. This is known as the Simpson index in ecology and as the Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics. A variant of this is known as the Hunter–Gaston index in microbiology^[10]

In linguistics and cryptanalysis this sum is known as the repeat rate. The incidence of coincidence (IC) is an unbiased estimator of this statistic^[11]

IC=\sum {\frac {f_{i}(f_{i}-1)}{n(n-1)}}

where f_i is the count of the i^th grapheme in the text and n is the total number of graphemes in the text.

M1

The M1 statistic defined above has been proposed several times in a number of different settings under a variety of names. These include Gini's index of mutability,^[12] Simpson's measure of diversity,^[13] Bachi's index of linguistic homogeneity,^[14] Mueller and Schuessler's index of qualitative variation,^[15] Gibbs and Martin's index of industry diversification,^[16] Lieberson's index.^[17] and Blau's index in sociology, psychology and management studies.^[18] The formulation of all these indices are identical.

Simpson's D is defined as

D=1-\sum _{{i=1}}^{K}{{\frac {n_{i}(n_{i}-1)}{n(n-1)}}}

where n is the total sample size and n_i is the number of items in the i^th category.

For large n we have

u\sim 1-\sum _{{i=1}}^{K}p_{i}^{2}

Another statistic that has been proposed is the coefficient of unalikeability which ranges between 0 and 1.^[19]

u={\frac {c(x,y)}{n^{2}-n}}

where n is the sample size and c(x,y) = 1 if x and y are alike and 0 otherwise.

For large n we have

u\sim 1-\sum _{{i=1}}^{K}p_{i}^{2}

where K is the number of categories.

Another related statistic is the quadratic entropy

H^{2}=2\left(1-\sum _{{i=1}}^{K}p_{i}^{2}\right)

which is itself related to the Gini index.

M2

Greenberg's monolingual non weighted index of linguistic diversity^[20] is the M2 statistic defined above.

M7

Another index – the M7 – was created based on the M4 index of Gibbs et al.^[21]

M7={\frac {\sum _{{i=1}}^{K}\sum _{{j=1}}^{L}|R_{i}-R|}{2\sum R_{i}}}

where

R_{{ij}}={\frac {O_{{ij}}}{E_{{ij}}}}={\frac {O_{{ij}}}{n_{i}p_{j}}}

and

R={\frac {\sum _{{i=1}}^{K}\sum _{{j=1}}^{L}R_{{ij}}}{\sum _{{i=1}}^{K}n_{i}}}

where K is the number of categories, L is the number of subtypes, O_ij and E_ij are the number observed and expected respectively of subtype j in the i^th category, n_i is the number in the i^th category and p_j is the proportion of subtype j in the complete sample.

Note: This index was designed to measure women's participation in the work place: the two subtypes it was developed for were male and female.

Other single sample indices

These indices are summary statistics of the variation within the sample.

Berger–Parker index

The Berger–Parker index equals the maximum $p_{i}$ value in the dataset, i.e. the proportional abundance of the most abundant type.^[22] This corresponds to the weighted generalized mean of the $p_{i}$ values when q approaches infinity, and hence equals the inverse of true diversity of order infinity (1/^∞D).

Brillouin index of diversity

This index is strictly applicable only to entire populations rather than to finite samples. It is defined as

I_B = \frac{ \log( N! ) - \sum_{ i = 1 }^K ( \log( n_i! ) ) }{ N }

where N is total number of individuals in the population, n_i is the number of individuals in the i^th category and N! is the factorial of N. Brillouin's index of evenness is defined as

E_{B}=I_{B}/I_{{B(\max )}}

where I_B(max) is the maximum value of I_B.

Hill's diversity numbers

Hill suggested a family of diversity numbers^[23]

N_{a}={\frac {1}{\left[\sum _{{i=1}}^{K}p_{i}^{a}\right]^{{a-1}}}}

For given values of a several of the other indices can be computed

a = 0: N_a = species richness
a = 1: N_a = Shannon's index
a = 2: N_a = 1/Simpson's index (without the small sample correction)
a = 3: N_a = 1/Berger–Parker index

Hill also suggested a family of evenness measures

E_{{a,b}}={\frac {N_{a}}{N_{b}}}

where a > b.

Hill's E₄ is

$E_{4}={\frac {N_{2}}{N_{1}}}$

Hill's E₅ is

$E_{5}={\frac {N_{2}-1}{N_{1}-1}}$

Margalef's index

$I_{{Marg}}={\frac {S-1}{log_{e}N}}$

where S is the number of data types in the sample and N is the total size of the sample.^[24]

Menhinick's index

I_{{\mathrm {Men}}}={\frac {S}{{\sqrt {N}}}}

where S is the number of data types in the sample and N is the total size of the sample.^[25]

In linguistics this index is the identical with the Kuraszkiewicz index (Guiard index) where S is the number of distinct words (types) and N is the total number of words (tokens) in the text being examined.^[26]^[27] This index can be derived as a special case of the Generalised Torquist function.^[28]

Q statistic

This is a statistic invented by Kempton and Taylor.^[29] and involves the quartiles of the sample. It is defined as

Q={\frac {{\frac {1}{2}}(n_{{R1}}+n_{{R2}})+\sum _{{j=R_{1}+1}}^{{R_{2}-1}}n_{j}}{log(R_{2}/R_{1})}}

where R₁ and R₁ are the 25% and 75% quartiles respectively on the cumulative species curve, n_j is the number of species in the j_th category, n_Ri is the number of species in the class where R_i falls (i = 1 or 2).

Shannon–Wiener index

This is taken from information theory

H=\log _{e}N-{\frac {1}{N}}\sum n_{i}p_{i}\log(p_{i})

where N is the total number in the sample and p_i is the proportion in the i^th category.

In ecology where this index is commonly used, H usually lies between 1.5 and 3.5 and only rarely exceeds 4.0.

An approximate formula for the standard deviation (SD) of H is

SD(H)={\frac {1}{N}}\left[\sum p_{i}[\log _{e}(p_{i})]^{2}-H^{2}\right]

where p_i is the proportion made up by the i^th category and N is the total in the sample.

A more accurate approximate value of the variance of H(var(H)) is given by^[30]

\operatorname {var}(H)={\frac {\sum p_{i}[\log(p_{i})]^{2}-\left[\sum p_{i}\log(p_{i})\right]^{2}}{N}}+{\frac {K-1}{2N^{2}}}+{\frac {-1+\sum p_{i}^{2}-\sum p_{i}^{{-1}}\log(p_{i})+\sum p_{i}^{{-1}}\sum p_{i}\log(p_{i})}{6N^{3}}}

where N is the sample size and K is the number of categories.

A related index is the Pielou J defined as

J={\frac {H}{\log _{e}(S)}}

One difficulty with this index is that S is unknown for a finite sample. In practice S is usually set to the maximum present in any category in the sample.

Rényi entropy

The Rényi entropy is a generalization of the Shannon entropy to other values of q than unity. It can be expressed:

{}^{q}H={\frac {1}{1-q}}\;\ln \left(\sum _{{i=1}}^{K}p_{i}^{q}\right)

which equals

{}^{q}H=\ln \left({1 \over {\sqrt[ {q-1}]{{\sum _{{i=1}}^{K}p_{i}p_{i}^{{q-1}}}}}}\right)=\ln({}^{q}\!D)

This means that taking the logarithm of true diversity based on any value of q gives the Rényi entropy corresponding to the same value of q.

The value of ${}^q\!D$ is also known as the Hill number.^[23]

McIntosh's D and E

D={\frac {N-{\sqrt {\sum _{{i=1}}^{K}n_{i}}}}{N-{\sqrt {N}}}}

where N is the total sample size and n_i is the number in the i^th category.

E={\frac {N-{\sqrt {\sum _{{i=1}}^{K}n_{i}}}}{N-{\frac {N}{{\sqrt {K}}}}}}

where K is the number of categories.

Fisher's alpha

This was the first index to be derived for diversity.^[31]

$K=\alpha \ln(1+{\frac {N}{\alpha }})$

where K is the number of categories and N is the number of data points in the sample. Fisher's α has to be estimated numerically from the data.

The expected number of individuals in the r^th category where the categories have been placed in increasing size is

E(n_{r})=\alpha {\frac {X^{r}}{r}}

where X is an empirical parameter lying between 0 and 1. While X is best estimated numerically an approximate value can be obtained by solving the following two equations

N={\frac {\alpha X}{1-X}}

K=-\alpha \ln(1-X)

where K is the number of categories and N is the total sample size.

The variance of α is approximately^[32]

\operatorname {var}(\alpha )={\frac {\alpha }{\ln(X)(1-X)}}

Strong's index

This index (D_w) is the distance between the Lorenz curve of species distribution and the 45 degree line. It is closely related to the Gini coefficient.^[33]

In symbols it is

D_{w}=max[{\frac {c_{i}}{K}}-{\frac {i}{N}}]

where max() is the maximum value taken over the N data points, K is the number of categories (or species) in the data set and c_i is the cumulative total up and including the i_th category.

Simpson's E

This is related to Simpson's D and is defined as

E={\frac {1}{D}}/K

where D is Simpson's D and K is the number of categories in the sample.

Smith & Wilson's indices

Smith and Wilson suggested a number of indices based on Simpson's D.

E_{1}={\frac {1-D}{1-{\frac {1}{K}}}}

E_{2}={\frac {\log _{e}(D)}{\log _{e}(K)}}

where D is Simpson's D and K is the number of categories.

Heip's index

E={\frac {e^{H}-1}{K-1}}

where H is the Shannon entropy and K is the number of categories.

This index is closely related to Sheldon's index which is

E={\frac {e^{H}}{K}}

where H is the Shannon entropy and K is the number of categories.

Camargo's index

This index was created by Camargo in 1993.^[34]

$E=1-\sum _{{i=1}}^{K}\sum _{{j=i+1}}^{K}{\frac {p_{i}-p_{j}}{K}}$

where K is the number of categories and p_i is the proportion in the i^th category.

Smith & Wilson's B

This index was proposed by Smith and Wilson in 1996.^[35]

B=1-{\frac {2}{\pi }}arctan(\theta )

where θ is the slope of the log(abundance)-rank curve.

Nee, Harvey and Cotgreave's index

This is the slope of the log(abundance)-rank curve.

Bulla's E

There are two versions of this index - one for continuous distributions (E_c) and the other for discrete (E_d).^[36]

E_{c}={\frac {O-{\frac {1}{K}}}{1-{\frac {1}{K}}}}

E_{d}={\frac {O-{\frac {1}{K}}-{\frac {K-1}{N}}}{1-{\frac {1}{K}}-{\frac {K-1}{N}}}}

where

O=1-{\frac {1}{2}}|p_{i}-{\frac {1}{K}}|

is the Schoener-Czekanoski index, K is the number of categories and N is the sample size.

Horn's information theory index

This index (R_ik) is based on Shannon's entropy.^[37] It is defined as

R_{{ik}}={\frac {H_{\max }-H_{{\mathrm {obs}}}}{H_{\max }-H_{\min }}}

where

X=\sum x_{{ij}}

X=\sum x_{{kj}}

H(X)=\sum {\frac {x_{{ij}}}{X}}\log {\frac {X}{x_{{ij}}}}

H(Y)=\sum {\frac {x_{{kj}}}{Y}}\log {\frac {Y}{x_{{kj}}}}

H_{\min }={\frac {X}{X+Y}}H(X)+{\frac {Y}{X+Y}}H(Y)

H_{\max }=\sum \left({\frac {x_{{ij}}}{X+Y}}\log {\frac {X+Y}{x_{{ij}}}}+{\frac {x_{{kj}}}{X+Y}}\log {\frac {X+Y}{x_{{kj}}}}\right)

H_{{\mathrm {obs}}}=\sum {\frac {x_{{ij}}+x_{{kj}}}{X+Y}}\log {\frac {X+Y}{x_{{ij}}+x_{{kj}}}}

In these equations x_ij and x_kj are the number of times the j^th data type appears in the i^th or k^th sample respectively.

Rarefaction index

In a rarefied sample a random subsample n in chosen from the total N items. In this sample some groups may be necessarily absent from this subsample. Let $X_{n}$ be the number of groups still present in the subsample of n items. $X_{n}$ is less than K the number of categories whenever at least one group is missing from this subsample.

The rarefaction curve, $f_{n}$ is defined as:

f_{n}=E[X_{n}]=K-{\binom {N}{n}}^{{-1}}\sum _{{i=1}}^{K}{\binom {N-N_{i}}{n}}

Note that 0 ≤ f(n) ≤ K.

Furthermore,

f(0)=0,\ f(1)=1,\ f(N)=K

Despite being defined at discrete values of n, these curves are most frequently displayed as continuous functions.^[38]

This index is discussed further in Rarefaction (ecology).

Caswell's V

This is a z type statistic based on Shannon's entropy.^[39]

V={\frac {H-E(H)}{SD(H)}}

where H is the Shannon entropy, E(H) is the expected Shannon entropy for a neutral model of distribution and SD(H) is the standard deviation of the entropy. The standard deviation is estimated from the formula derived by Pielou

SD(H)={\frac {1}{N}}\left[\sum p_{i}[\log _{e}(p_{i})]^{2}-H^{2}\right]

where p_i is the proportion made up by the i^th category and N is the total in the sample.

Lloyd & Ghelardi's index

This is

I_{{LG}}={\frac {K}{K'}}

where K is the number of categories and K' is the number of categories according to MacArthur's broken stick model yielding the observed diversity.

Average taxonomic distinctness index

This index is used to compare the relationship between hosts and their parasites.^[40] It incorporates information about the phylogenetic relationship amongst the host species.

S_{{TD}}=2{\frac {\sum \sum _{{i<j}}\omega _{{ij}}}{s(s-1)}}

where s is the number of host species used by a parasite and ω_ij is the taxonomic distinctness between host species i and j.

Index of qualitative variation

Several indices with this name have been proposed.

One of these is

$IQV={\frac {K(100^{2}-\sum _{i=1}^{K}p_{i}^{2})}{100^{2}(K-1)}}={\frac {K}{K-1}}(1-\sum _{i=1}^{K}(p_{i}/100)^{2})$

where K is the number of categories and p_i is the proportion of the sample that lies in the i^th category.

Indices for comparison of two or more data types within a single sample

Several of these indexes have been developed to document the degree to which different data types of interest may coexist within a geographic area.

Index of dissimilarity

Let A and B be two types of data item. Then the index of dissimilarity is

D={\frac {1}{2}}\sum _{{i=1}}^{K}\left|{\frac {A_{i}}{A}}-{\frac {B_{i}}{B}}\right|

where

A=\sum _{{i=1}}^{K}A_{i}

B=\sum _{{i=1}}^{K}B_{i}

A_i is the number of data type A at sample site i, B_i is the number of data type B at sample site i, K is the number of sites sampled and || is the absolute value.

This index is probably better known as the index of dissimilarity (D).^[41] It is closely related to the Gini index.

This index is biased as its expectation under a uniform distribution is > 0.

A modification of this index has been proposed by Gorard and Taylor.^[42] Their index (GT) is

GT=D\left(1-{\frac {A}{A+B}}\right)

Index of segregation

The index of segregation (IS)^[43] is

SI={\frac {1}{2}}\sum _{{i=1}}^{K}|{\frac {A_{i}}{A}}-{\frac {t_{i}-A_{i}}{T-A}}|

where

A=\sum _{{i=1}}^{K}A_{i}

T=\sum _{{i=1}}^{K}t_{i}

and K is the number of units, A_i and t_i is the number of data type A in unit i and the total number of all data types in unit i.

Hutchen's square root index

This index (H) is defined as^[44]

H=1-\sum _{{i=1}}^{K}\sum _{{j=1}}^{i}{\sqrt {p_{i}p_{j}}}

where p_i is the proportion of the sample composed of the i^th variate.

Lieberson's isolation index

This index ( L_xy ) was invented by Lieberson in 1981.^[45]

L_{{xy}}={\frac {1}{N}}\sum _{{i=1}}^{K}{\frac {X_{i}Y_{i}}{X_{{\mathrm {tot}}}}}

where X_i and Y_i are the variables of interest at the i^th site, K is the number of sites examined and X_tot is the total number of variate of type X in the study.

Bell's index

This index is defined as^[46]

I_{R}={\frac {p_{{xx}}-p_{x}}{1-p_{x}}}

where p_x is the proportion of the sample made up of variates of type X and

p_{{xx}}={\frac {\sum _{{i=1}}^{K}x_{i}p_{i}}{N_{x}}}

where N_x is the total number of variates of type X in the study, K is the number of samples in the study and x_i and p_i are the number of variates and the proportion of variates of type X respectively in the i^th sample.

Index of isolation

The index of isolation is

II=\sum _{{i=1}}^{K}{\frac {A_{i}}{A}}{\frac {A_{i}}{t_{i}}}

where K is the number of units in the study, A_i and t_i is the number of units of type A and the number of all units in i_th sample.

A modified index of isolation has also been proposed

MII={\frac {II-{\frac {A}{T}}}{1-{\frac {A}{T}}}}

The MII lies between 0 and 1.

Gorard's index of segregation

This index (GS) is defined as

GS={\frac {1}{2}}\sum _{{i=1}}^{K}|{\frac {A_{i}}{A}}-{\frac {t_{i}}{T}}|

where

A=\sum _{{i=1}}^{K}A_{i}

T=\sum _{{i=1}}^{K}t_{i}

and A_i and t_i are the number of data items of type A and the total number of items in the i^th sample.

Index of exposure

This index is defined as

IE=\sum _{{i=1}}^{K}{\frac {A_{i}}{A}}{\frac {B_{i}}{t_{i}}}

where

A=\sum _{{i=1}}^{K}A_{i}

and A_i and B_i are the number of types A and B in the i^th category and t_i is the total number of data points in the i^th category.

Ochai index

This is a binary form of the cosine index.^[47] It is used to compare presence/absence data of two data types (here A and B). It is defined as

$O = \frac{ a }{ \sqrt{ ( a + b )( a + c ) } }$

where a is the number of sample units where both A and B are found, b is number of sample units where A but not B occurs and c is the number of sample units where type B is present but not type A.

Kulczyński's coefficient

This coefficient was invented by Stanisław Kulczyński in 1927^[48] and is an index of association between two types (here A and B). It varies in value between 0 and 1. It is defined as

$K = \frac{ a }{ 2 } ( \frac{ 1 }{ a + b } + \frac{ 1 }{ a + c } )$

where a is the number of sample units where type A and type B are present, b is the number of sample units where type A but not type B is present and c is the number of sample units where type B is present but not type A.

Yule's Q

This index was invented by Yule in 1900.^[49] It concerns the association of two different types (here A and B). It is defined as

$Q = \frac{ ad - bc }{ ad + bc }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Q varies in value between -1 and +1. In the ordinal case Q is known as the Goodman-Kruskal γ.

Because the denominator potentially may be zero, Leinhert and Sporer have recommended adding +1 to a, b, c and d.^[50]

Yule's Y

This index is defined as

$Y = \frac{ \sqrt{ ad } - \sqrt{ bc } }{ \sqrt{ ad } + \sqrt{ bc } }$

Baroni-Urbani-Buser coefficient

This index was invented by Baroni-Urbani and Buser in 1976.^[51] It varies between 0 and 1 in value. It is defined as

$BUB = \frac{ \sqrt{ ad } + a }{ \sqrt{ ad } + a + b + c }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. When d = 0, this index is identical to the Jaccard index.