Various Similarity Metrics for Vector Data and Language Embeddings

RapidFork Technology
5 min readMar 16, 2024

--

tldr; Please skip the first section

What’s a vector?

https://rapidfork.medium.com/vector-f10090c57c96

What’s embedding?

https://rapidfork.medium.com/embedding-f311e619898d

Various Metrics

  • Cosine Similarity: This metric measures the cosine of the angle between two non-zero vectors of an inner product space, which is widely used in high-dimensional positive spaces such as text analysis. In language embeddings, cosine similarity helps in finding the closeness of two words or documents irrespective of their size.
  • Euclidean Distance (L2 norm): This is the “ordinary” straight-line distance between two points in Euclidean space. In the context of vector data and language embeddings, it measures the actual distance between points (or vectors), often used in clustering algorithms like K-means.

What’s a norm now?

Imagine you and your friends are in different parts of a park and you’re all planning to meet up at the ice cream stand. The “norm” is like your own personal rule for how you decide the distance you need to walk. But instead of being a single rule, there are different kinds of “rules” you can choose from, depending on what’s important to you:

L1 Norm (Manhattan or Taxicab Norm): This is like deciding to walk only along the paths in the park, turning left or right as needed, but not cutting across the grass. The distance you measure using this rule is like adding up the total number of steps you take along all the paths to get to the meeting point.

L2 Norm (Euclidean Norm): This is like using a big ruler to stretch a straight line from where you are to the ice cream stand and walking directly in that straight line, even if it means cutting across the grass. The distance you measure here is the shortest possible one, like the straight line distance between you and the stand.

L∞ Norm (Infinity or Maximum Norm): This is a bit different. Imagine you can only move horizontally or vertically, but you want to minimize the longer of the two distances. So, if you’re far to the east but only a little to the north of the stand, you focus on reducing that east-west distance as much as possible, even if it means not improving the north-south distance.

  • Manhattan Distance (L1 norm): Also known as “city block” distance, this metric measures the distance between two points in a grid-based on a strictly horizontal and/or vertical path, akin to the streets of Manhattan. It is useful when the effect of outliers needs to be diminished, and variances across dimensions are equal.
  • Jaccard Similarity: This is used for comparing the similarity and diversity of sample sets. It measures how many attributes are shared between two documents as a proportion of the total number of attributes across them. It’s especially useful in cases where the data representation is binary (like set-based data or high-dimensional data).
  • Hamming Distance: This measures the distance between two strings of equal length by counting the number of positions at which the corresponding symbols are different. It’s useful in scenarios where the similarity between two strings (binary strings, for example) needs to be assessed.
  • Pearson Correlation: While not a distance metric in the strict sense, Pearson correlation measures the linear correlation between two variables. In the context of language embeddings and vector data, it can help understand the linear similarity between different documents or word vectors.
  • Spearman’s Rank Correlation: Similar to Pearson, but it measures the monotonicity of the relationship between two variables rather than the linearity. This can be particularly useful for non-parametric data in language models.
  • Levenshtein Distance: Although more commonly used in string manipulation, this metric measures the minimum number of single-character edits required to change one word into the other. It’s useful in text analysis for tasks like spelling correction or similarity checks.
  • Angular Distance: This measures the angle between two vectors, which can be related to cosine similarity but provides an actual angular measure rather than the cosine of the angle.
  • Mahalanobis Distance: This is a measure of the distance between a point and a distribution. In the context of vector data, it’s useful for identifying outliers or performing cluster analysis when the data has varying scales and the variables are correlated.
  • Tanimoto Coefficient (Jaccard Coefficient for binary data): This is used to measure the similarity and diversity between two sets. It is similar to the Jaccard similarity but specifically adjusted for binary data vectors. This metric is widely used in cheminformatics to measure the similarity between chemical structures.
  • Dice’s Coefficient: This is another similarity measure related to Jaccard’s Index, but it gives more weight to the intersection of the two sets. It is calculated as twice the intersection of the sets divided by the sum of their sizes. It’s often used in ecological and biological data analysis.
  • Hellinger Distance: This is used in statistics to quantify the similarity between two probability distributions. It is related to the Bhattacharyya distance and is used in various applications, including text analysis and image processing.
  • Bhattacharyya Distance: This measures the similarity of two discrete or continuous probability distributions; it is closely related to the Hellinger distance and is often used in classification problems.
  • KL Divergence (Kullback-Leibler Divergence): This is a measure from information theory, representing the difference between two probability distributions over the same variable. It’s a non-symmetric measure and can be seen as the amount of information lost when one distribution approximates the other. This is especially useful in machine learning models and natural language processing for measuring how one language model diverges from another.
  • Wasserstein Distance (Earth Mover’s Distance): This metric is a measure of the distance between two probability distributions, which is interpreted as the minimum amount of work needed to transform one distribution into the other. It’s widely used in optimal transport problems, image recognition, and deep learning.
  • Canberra Distance: This is a numerical measure of the distance between pairs of points in a vector space, particularly used when the data contains zeros, as it gives more weight to these discrepancies. It’s useful for high-dimensional data analysis and genetic data.
  • Chebyshev Distance: This is a distance measure defined as the greatest of difference between any coordinate of two vectors in a vector space. It’s also known as maximum value distance and is particularly useful in infinity norm problems, grid-based pathfinding, and when considering the worst-case scenario between different dimensions.
  • Minkowski Distance: A generalization of both the Euclidean distance and the Manhattan distance. It’s a parameterized metric that can be adjusted according to the problem requirements. When the parameter is set to 1, it becomes the Manhattan distance, and when it’s set to 2, it becomes the Euclidean distance.
  • Normalized Google Distance: Based on information content from search engine results, it measures the semantic similarity between words or phrases by utilizing the number of hits returned by a search engine for various queries.

--

--