kiwirafe.blog

Simhash Algorithm FAQs

Q1. What value should we choose for ?

It’s obvious that larger the means better the accuracy. According to Google’s “Detecting Near-Duplicates for Web Crawling”, 64-bit fingerprints are sufficient for 8 billion webpages. Hence it is common practice to set to 64.

Q2. Why do we need multiple random hyperplanes?

We know that a random hyperplane has a probability of cutting through the angle between the two vectors. So why do we need multiple random hyperplanes?
From intuition, we know the more hyperplanes we have, the more accurate it is. However, below is a mathematical explanation.
This involves calculating the expectation and variance of several functions.

Step 1: Define a function for a single hyperplane. This function doesn’t have anything to do with the Simhash algorithm, we are only using it to calculate and compare expectation and variance.

Step 2: Calculate the expectation and variance for .

Step 3: Scale by a constant factor of for easier comparison:

Step 4: Define a function using multiple hyperplanes:

Step 5: Calculate the expectation and variance of :

Step 6: Scale by a constant factor of for comparison:

We can see that expectation remains the same for both situations, but the variance is different. If we are using only one hyperplane, then the variance is . However, if we are using multiple hyperplanes, then the variance is , which is smaller by factor . This proves that the more hyperplanes we use, the better accuracy it is (which also proves the statement in Q1, since ).

Q3. Why are we not using random vectors?

You may notice that we are using random signed vectors (vectors with only 1 and -1) instead of completely random vectors. This is because it is quite costly to produce large amount of random numbers. Instead, we can treat hash functions as random signed vectors (hash functions can quickly produce random 0s and 1s, which can then be used for signed vectors).

On the other hand, this also causes a decrease in accuracy, since the probability that the hyperplane cuts the angle is no longer . I haven’t been able to calculate how much the accuracy decreases mathematically. If you have thoughts about this, you can email me or answer this Stackoverflow question I posted.

How to use xiangsi?

Xiangsi is a Pypi library designed for text similarity. It provides 4 algorithms: Cosine, Jaccard, Simhash and Minhash. We can install it with pip:

pip3 install xiangsi

If you just want to have some quick takeaway code, run the following.

import xiangsi as xs
xs.simhash("A mathematician found a solution to the problem.", "The problem was solved by a young mathematician.")

Otherwise, here are some more details about the calculation.

Different Weights:
We can set different weighting methods for Simhash.

  1. Default weight (frequency of the words).

    import xiangsi as xs
    xs.simhash("A mathematician found a solution to the problem.", "The problem was solved by a young mathematician.")
    
  2. TF-IDF

    For TF-IDF, we have to first construct the IDF corpus. This calculates the IDF for all the strings inside the corpus. We only need to do this once for multiple Simhash calculations, given that you are using the same IDF corpus.

    import xiangsi as xs
    
    arg = [
        "There was a time in his life when her rudeness would have set him over the edge.",
        "He would have raised his voice and demanded to speak to the manager.",
        "That was no longer the case. He barely reacted at all, letting the rudeness melt away without saying a word back to her. ",
        "A mathematician found a solution to the problem."
        "The problem was solved by a young mathematician."
    ] # IDF 
    xs.weight = "TFIDF" # Set weight method as TFIDF
    xs.construct(arg) # Constructs the IDF corpus. We only need to do this once.
    xs.simhash("A mathematician found a solution to the problem.", "The problem was solved by a young mathematician.")
    
  3. No weight (all words have weight 1)

    import xiangsi as xs
    
    xs.weight = "None"
    xs.simhash("A mathematician found a solution to the problem.", "The problem was solved by a young mathematician.")
    

Different Variables:
We can also change the for Simhash.

import xiangsi as xs
# The value N. Default is 64.
xs.feature = 64