How to create a “Maybe you meant”​ in your searches (JS + PHP)

I know, you use Google as a search motor, of course you do! And because of that you are impressed about the “maybe you meant” that some times puts you when you write a misspelling in your text. As the TV series say: How they did it?.

Before doing it

Before telling you a easy way to do it and, later how can you improve it, you must know what you need for doing it or, at least, the way I use to do it.

In order to do it I use a PHP backend and a Javascript frontend (in my case with React, but you can do it in Vanilla without problem if you understand the way).

The idea

The idea is very simple:

  1. Take a phrase.
  2. Separate the words.
  3. Compare every word with something that it’s good (some real words).
  4. If it’s very approximated, it’s a misspelling and the user meants the good word.

So, you will have two main problems. First separate the words of the phrase and second the comparison with something ok. Nothing more.

The backend

I use PHP as a backend, first because I’m a fullstack PHP backend developer and, second, because PHP has a small trick that can help you to do it faster and easer. But that don’t meatn you can reproduce in other languages without problem. And, third, because the web application where it’s besides is made on PHP.

You can have a database too or files (with the format you like) too or simply bigs arrays. All depends on your infrastructure or how to well you can do it.

The frontend

Everything can be done in the backend, but for been modern and create a good quality software (and well because server time is more expensive that client time) there’s part made on the front. Because the web app is refactored in React I had use it.

How it’s done

On the backstage

On the PHP file, catch (by get, by post) the query or the text you want to analice. On my case I had done a syntactic analyzer library called analizadorsintactico.php (it’s in spanish, sorry) with an object called analizadorsintactico and some methods. I use CodeIgniter as a framework for PHP. CodeIgniter is a MVC framework for PHP.

Here is the complete code (well it’s not complete, it’s a part of it of the controller:

public function busqueda() {
		// Function that takes a query, process it, make a search and return a JSON

		$datos["q"] = $this -> input -> post_get('q');

        // We take out the words of the query
		$datos["palabrasQuery"] = $this -> analizadorsintactico -> queryTexto($datos["q"]);

        // Database of nice words
        $palabrasBD = $this -> palabras_model -> devuelvePalabras();

        // Words "maybe you meant"
        $datos["palabrasQueryQuisoDecir"] = $this -> analizadorsintactico -> similitudes_palabras($datos["palabrasQuery"], $palabrasBD);

		// Do the search using (in this case) a model who return the results
		$datos["resultados"] = $this -> buscador_model -> realizeQuery($datos["palabrasQuery"];

	    // Changing the header of the response to JSON
        header('Content-Type: application/json');
        // Print the data JSON encoded
        echo json_encode($datos);
	}

As I say, first we take (get or port) the query. Second we clean the query because in natural language you put a lot of words that are not needed for a typical search, for example “for”, “the”, “I”, “you”… and the best way to do it, a regular expression. So a part of the analizadorsintactico.php library will be:

function queryTexto($string) {
    // Method that returns an array with all the words separated or false if not

    // Create the empty response array
    $trozos = array();

	// Using a regular expresion for cleaning the string (spanish language)
	$textoSinMierda = preg_replace("/(\b(a|e|o|u)\ )|(\ben\b)|(\bun\b)|(\bde(\b|l))|(\bqu(|é|e)\b)|(\b(a|e)l\b)|(\bell(o|a)(\b|s))|(\bla(\b|s))|(\blo(\b|s))|(\bante\b)|(\bo\b)|(\by\b)|(\bes\b)|(\bsu\b)|(\,|\.|\;)/", "", $string);
	// Now we "explode the words"
    $trozostmp2 = explode(" ", $textoSinMierda);
	// Because user can put double white space we do this!
    foreach ($trozostmp2 as $cacho) {
        if ($cacho != "") {
		  // If it's not a white space we add into an array
          array_push($trozos, $cacho);
        }
    }
	  // Then return the new array of words
      return $trozos;
    }

	// if not, always return false
    return false;
  }

Next (remember we are in the controller now again), in my case, we get the good words from a table in a database. Because I use a MVC, I call a model to retrieve them all. For this there’s not too much to say 🙂

After that we call the analizadorsintactico.php library again to compare the array of only words that get before to the array of good words that retrieve from the database. So, let’s see again another part of analizadorsintactico.php file.

function similitudes($arrayDatos, $arrayConQuienComparar) {
    // Method that search for similar words
	// Return an array with the similar words in the same order
	// Need a private method call similitudes_sale

	// First the return array
    $returnArray = array();

    // We go through the array
    foreach($arrayDatos as $rowCiudadesBusqueda) {
	  // Test if it's empty because... who knows
      if ($rowCiudadesBusqueda != '') {
		// For everyone
        foreach($arrayConQuienComparar as $rowCiudadesArray) {
		  // We made the Levenshtein comparation in our private method
          $resulttmp = $this -> similitudes_sale($rowCiudadesArray, $rowCiudadesBusqueda);
		  // Another test... test if it's empty again
          if ($resulttmp !='') {
			// If not, pushing the array
            array_push($returnArray, $this -> similitudes_sale($rowCiudadesArray, $rowCiudadesBusqueda));
          }
        }
      }
    }

    // The return. This can be done better, I know
	if (empty($returnArray) == false) {
		// There's data
		return $returnArray;
	} 
	// There's no data
	return false;
  }


private function similitudes_sale($origen, $destino) {
    // Function that return the word if it's a 65% (or more) equal using the Levenshtein test// if not, it returns false

    $sim = similar_text(strtoupper($origen), strtoupper($destino), $perc);
    
    if ($perc > 65) {
      return $origen;
    } else {
      return false;
    }

  }

As you can see it uses the Levensthtein distance between two words and if the per cent of equality is more than a 65 it returns the good word.

The Levensthtein distance is the maximum changes you must do in a word in order to convert to another. The function similar_text of PHP it’s a recursive function that do on two words much faster that other I had seen, so using this function is better using the levenshtein function (that exists on PHP).

Knowing that I use similar_text based on the Levensthtein distance in a good quality algorithm you can reproduce it on other languages easily. Remember that this function is the base of “maybe you meant…”.

This method similitudes_sale can be improved, for example, changing the rate by argument because different words (of what they meant) need different rates. But this is up to you.

And the final stage on the controller it’s the most easy one. Having an array of good words, do the search. Nothing more.

On the stage

The frontend it’s more easy, because all the hard work it’s done on the backend (and the most CPU consuming time too). Remember that I use React.

So, first, do the fetch on the ComponentWillMount(), you know, before all, the usual place where you must do this things, and update the state of the component with what you receives.

componentWillMount() {

    // We take the query from the URL
    let datosBusqueda = window.location.search.substring(1).split('=')[1];

    // Decode it, because it's on the URL
    datosBusqueda = decodeURIComponent(datosBusqueda);

    // Do the fetch man 🙂
    fetch('/index.php/components/busqueda/busqueda?q='+datosBusqueda)
      .then((respuesta) =>  respuesta.json())
      .then((respuestaJSON) => {
        // Chaging the state of the component
        this.setState({
          resultados: respuestaJSON.resultados, // results
          query: respuestaJSON.q, // query
          palabrasQueryArreglada: respuestaJSON.palabrasQueryQuisoDecir, // nice words
          isLoading: false // the typical for knowing that it's finish
        });
      })
      .catch((error) => {
        // Error!!
        alert('Lo sentimos\n\rHa habido un error al realizar la busqueda');
        throw 'Error en busqueda: '.error;
      });
}

You can see that the controller of the PHP that generates the JSON have this variables 🙂 nothing more to say.

So, there’s (on the component) two main things (well three). The results (resultados), the original query (query) and the good quality query (palabrasQueryArreglada).

With this you can compone the “Maybe you meant” on the render() method of the component by comparing the two “arrays” from the query and the palabrasQueryArreglada.

render() {

 [....... other code for your render ........]

    // If there's "bad words" that bee good words
    var quisoDecir = this.state.query.split(" ");
    var quisoDecirRespuesta;

    if (this.state.palabrasQueryArreglada.length > 0) {
      // We create the new well writen query
      quisoDecir = quisoDecir.map((valor, indice) => {
        // We go through seen that it's not null

        // Creating the return variable that it's false
        let retorno = false;

        for (let i = 0; i < this.state.palabrasQueryArreglada.length; i++) {

          if(this.state.palabrasQueryArreglada[i][valor] !== undefined) {
            // If it's not undefined, here is and we must return it
            retorno = true;
            return this.state.palabrasQueryArreglada[i][valor];
          }
        }

        if (retorno == true) {
          // This is not needed but we put here because... you knows...
          
        } else {
          // if not, we return the original text
          return valor;
        }
      });

    [....... other code for your render ........]

}

So, with this what we are doing (it can be done better, of course) it's go through the original query word by word looking if the word is on the array where I had put the good words and if it's not on that we return the original. Simple!.

The other parts of the render() method is what you want to do, so better not to put here 🙂

Future?

You can improve this in many ways. One is something I tell you before. The percent of of two words are similar depends on the word by himself, the size, the type of word and so on. This is what makes Google so special (and the background computing power) but you can do it by yourself with some ML (machine learning).

With a ML you can modify the percent needed that you will pass to similitudes_sale private method and be more accurate.

You can have a very big database (or array, or JSON) with a lot of words and not only the ones you think are more common but, you must remember that you will need more computer power for the response.

You can improve (if you don't use similar_text on PHP) the way to compare two strings in order to have the Levensthtein distance.

Maybe it's faster to send JSON to the client and do it on the client side.

Or maybe it's faster read some JSON files that wait for your database (this depends of your IT infrastructure, of course).

And, of course, you can refactor my code to be better (I know that my programming skills are not perfect). On this way you can improve it better 🙂

Whatever you do, here's the idea and how I do it... and works 🙂

And, last, remember, this code is made for a web app I'm refactoring, so part of the code (the variable in this case) are in spanish and I had translate the comments to english... be kind with it. You can find it in my GitHub.

Via: My LinkedIn

Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.