Simhash Algorithm FAQs
Q1. What value should we choose for ?
It’s obvious that larger the
Q2. Why do we need multiple random hyperplanes?
We know that a random hyperplane has a probability
From intuition, we know the more hyperplanes we have, the more accurate it is. However, below is a mathematical explanation.
This involves calculating the expectation and variance of several functions.
Step 1: Define a function
Step 2: Calculate the expectation and variance for
Step 3: Scale by a constant factor of
Step 4: Define a function
Step 5: Calculate the expectation and variance of
Step 6: Scale by a constant factor of
We can see that expectation remains the same for both situations, but the variance is different. If we are using only one hyperplane, then the variance is
Q3. Why are we not using random vectors?
You may notice that we are using random signed vectors (vectors with only 1 and -1) instead of completely random vectors. This is because it is quite costly to produce large amount of random numbers. Instead, we can treat hash functions as random signed vectors (hash functions can quickly produce random 0s and 1s, which can then be used for signed vectors).
On the other hand, this also causes a decrease in accuracy, since the probability that the hyperplane cuts the angle is no longer
How to use xiangsi
?
Xiangsi is a Pypi library designed for text similarity. It provides 4 algorithms: Cosine, Jaccard, Simhash and Minhash. We can install it with pip:
pip3 install xiangsi
If you just want to have some quick takeaway code, run the following.
import xiangsi as xs
xs.simhash("A mathematician found a solution to the problem.", "The problem was solved by a young mathematician.")
Otherwise, here are some more details about the calculation.
Different Weights:
We can set different weighting methods for Simhash.
Default weight (frequency of the words).
import xiangsi as xs xs.simhash("A mathematician found a solution to the problem.", "The problem was solved by a young mathematician.")
TF-IDF
For TF-IDF, we have to first construct the IDF corpus. This calculates the IDF for all the strings inside the corpus. We only need to do this once for multiple Simhash calculations, given that you are using the same IDF corpus.
import xiangsi as xs arg = [ "There was a time in his life when her rudeness would have set him over the edge.", "He would have raised his voice and demanded to speak to the manager.", "That was no longer the case. He barely reacted at all, letting the rudeness melt away without saying a word back to her. ", "A mathematician found a solution to the problem." "The problem was solved by a young mathematician." ] # IDF xs.weight = "TFIDF" # Set weight method as TFIDF xs.construct(arg) # Constructs the IDF corpus. We only need to do this once. xs.simhash("A mathematician found a solution to the problem.", "The problem was solved by a young mathematician.")
No weight (all words have weight 1)
import xiangsi as xs xs.weight = "None" xs.simhash("A mathematician found a solution to the problem.", "The problem was solved by a young mathematician.")
Different Variables:
We can also change the
import xiangsi as xs
# The value N. Default is 64.
xs.feature = 64