Using Manticore/Sphinx Search with MySQL

I started using sphinxsearch for fulltext search a long time ago and I have been quite satisfied with it. A year ago I decided migrating to manticore which is an up-to-date, fully open source fork of sphinx maintained by original sphinx developers. In this article and following blog posts, I am going to talk about some tips, workarounds and how to use it efficiently in combination with mysql. For the rest of the article, I will use the name "manticore" only.

Using MySQL as Data Source

Manticore is well integrated with mysql, which means you can just set the configuration parameters and indexer does all the job. Of course, this is true for only fully (re)indexing. However, a common use case is, running a cron job once per every day/week to index all documents at once and inserting new documents one by one during each period. In this case, you must create a real-time index and modify it in your app using mysql protocol. More on this will be explained in the next article.

If you use myisam storage engine, you could easily use builtin fulltext capabilities of mysql/myisam; however, it is highly inefficient because of its locking behaviour as well as engine being outdated.

Let's continue with a simplified real-world example. Suppose that we have a mysql table in which we store documents for fulltext search purposes. Here is a product table that has an integer primary key, a variable length name and product category list.

CREATE TABLE `product`  (
  `id` int(0) UNSIGNED NOT NULL AUTO_INCREMENT,
  `name` VARCHAR(255) NOT NULL,
  `categories` text NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE = InnoDB;

# a table to keep some indexing information
CREATE TABLE `product_search_status`  (
  `id` varchar(30) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL,
  `value` bigint(20) UNSIGNED NOT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB;

We need to start configuring manticore by defining data source in configuration file. Assuming that we want to search in both name and categories, we need 2 fulltext indexes.

source product_data
{
        type        = mysql

        # mysql connection params
        sql_host  = localhost
        sql_user  = root
        sql_pass  = 123456
        sql_port  = 3306
        sql_db = testdb
        # mysql query to create manticore documents
        # we set unicode charset and wait_timeout to a high value to prevent connection timeout errors
        sql_query_pre = SET NAMES utf8
        sql_query_pre = SET SESSION wait_timeout=3600
        # we store the index time for information
        sql_query_pre = REPLACE INTO product_search_status (id, value) VALUES ('last_indexed_time', UNIX_TIMESTAMP())
        # we set start-end document ids so that manticore will know where to start and stop indexing 
        sql_query_range = SELECT MIN(id), MAX(id) FROM product
        sql_range_step = 10000
        # this is the main query to create documents
        sql_query = SELECT \
                         id, \
                         name AS name_ft, \
                         categories AS categories_ft, \
                         name \
                   FROM product \
                   WHERE id >= $start AND id <= $end
        # we store the most recent document id for information
        sql_query_post_index = REPLACE INTO product_search_status (id, value) VALUES ('last_indexed_id', $maxid)

        # the first column in above query must be **unique** document id
        # the others must be defined as follows unless they are fulltext columns, which means any column that is not defined here except id will be indexed for fulltext search.
        sql_attr_string = name
}

In the configuration above, I added product name twice to the query to get product name from the result set, so that I can response back to the client as fast as possible for better search experience. The reason behind this is that manticore does not store fulltext indexes in plain text, but in rather complex format, so it cannot return the text back. This is also why it can search billions of documents very fast.

In most scenarios there is a lot more to display in search results, in which case indexing a column twice is not a good idea. It would consume unnecessary disk space and memory. Instead, the document IDs should be retrieved from manticore to query mysql database to fetch rest of the data.

Indexing

Manticore has few configuration blocks such as source, index, indexer, searchd. We already defined the data source using source block, and it's time to use index to define a fulltext index based on product data.

index product
{
        path = /var/lib/manticore/data/product
        source = product_data
        morphology = none
        min_word_len  = 3
        min_prefix_len = 3
}

The configuration above is pretty simple and straightforward: Do not rely on any morphology (e.g. stemming), do not index words less than 3 chars to gain a bit space and performance, enable prefix/wildcard search (e.g. lap* -> laptop) and restrict it to minimum 3 chars. Now, the following command will (re)build the index based on product data.

$ indexer --rotate product
using config file '/etc/sphinxsearch/sphinx.conf'...
indexing index 'product'...
collected 829 docs, 0.2 MB
creating lookup: 0.8 Kdocs, 100.0% done
creating histograms: 0.8 Kdocs, 100.0% done
sorted 0.0 Mhits, 100.0% done
total 829 docs, 223480 bytes
total 0.416 sec, 536143 bytes/sec, 1988.82 docs/sec
total 4 reads, 0.000 sec, 53.1 kb/call avg, 0.0 msec/call avg
total 15 writes, 0.000 sec, 26.3 kb/call avg, 0.0 msec/call avg
rotating indices: successfully sent SIGHUP to searchd (pid=13968).

The Unicode Characters

Unicode characters on searches will be ignored, if not configured correctly. If you want to support searching some language specific characters the right way, you must use a charset table, otherwise they will be ignored.

An example would be "şçğüöıŞÇĞÜÖI" chars in tr-TR (Turkish) language. When you store these chars in your index, by default, your search query including these chars will respond you an empty or wrong resultset, because the words containing these chars in your query will not be processed.

There was only one workaround for this problem until manticore v3: creating a charset table having a unicode list of special characters. This is quite useful for few languages, but otherwise it would be a mess. Since version 3.0, fortunately there is now cjk and non_cjk values that handles almost all the charsets internally. An example of this can be seen as below.

index product
{
        morphology = none
        min_word_len  = 3
        min_prefix_len = 3
        # manticore v3.0+
        charset_table = non_cjk
        # before v3.0
        regexp_filter = (ı|I) => İ
        charset_table = 0..9, a..h, j..z, A..H->a..h, J..Z->j..z, \
                        U+131->U+049, U+049, \
                        U+069->U+130, U+130, \
                        U+0FC->U+0DC, U+0DC, \
                        U+15F->U+15E, U+15E, \
                        U+0F6->U+0D6, U+0D6, \
                        U+0E7->U+0C7, U+0C7, \
                        U+11F->U+11E, U+11E, \
                        U+40, U+2E, U+5F, U+2B, U+2D, U+25, U+23
}

U+131->U+049, U+049 means treat lowercase dotless "ı" as uppercase latin "I", and ignore itself.

The use of regexp_filter is like an edge case caused by Turkish language. Unlike in other latin languages, uppercase counterparts of "i" and "ı" in this language are "İ" and "I" with respectively. Here we can use regexp_filter = (ı|I) => İ to make all four of "iİıI" interchangeable similar to English.

Querying

The following command will start mysql console for manticore with default port (as long as mysql client is installed on the same machine).

mysql -h0 -P9306
  • Find all products whose name contain words starting with 'laptop':
mysql> SELECT * FROM `product` WHERE MATCH('@name_ft laptop*') LIMIT 0, 10;
+------+----------------------------------------------------------------------------------------------------------------------------------------+
| id   | name                                                                                                                                   |
+------+----------------------------------------------------------------------------------------------------------------------------------------+
|   40 | Samsung - 850 PRO 512GB Internal SATA III Solid State Drive for Laptops                                                                |
|  153 | DENAQ - AC Power Adapter and Charger for Select HP Omnibook, Pavilion and Presario Laptops - Black                                     |
|  207 | Corsair CMSA8GX3M2A1066C7 Apple 8 GB Dual Channel Kit DDR3 1066 (PC3 8500) 204-Pin DDR3 Laptop SO-DIMM Memory 1.5V                     |
|  228 | DENAQ - AC Power Adapter and Charger for Select Toshiba Satellite and Satellite Pro Laptops - Black                                    |
|  238 | Lenovo - AC Adapter for Select Lenovo Yoga Laptops - Black                                                                             |
+------+----------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)
  • Find all products whose name include "wireless" and, categories include "mobile":
mysql> SELECT * FROM `product` WHERE MATCH('@name_ft wireless @categories_ft mobile') Limit 0, 5;
+------+-------------------------------------------------------------------------------------------------+
| id   | name                                                                                            |
+------+-------------------------------------------------------------------------------------------------+
| 3820 | h.ear go Wireless Speaker (Viridian Blue)                                                       |
| 6406 | Sony SRS-XB41 Portable Wireless Bluetooth Speaker - Black - SRSXB41/B (Certified Refurbished)   |
| 5996 | h.ear go Wireless Speaker (Cinnabar Red)                                                        |
| 6977 | Boombox+ Bluetooth Wireless Speaker                                                             |
|  107 | Air-Fi Runaway AF32 Stereo Bluetooth Wireless Headphones with Hidden Microphone (Black and Red) |
+------+-------------------------------------------------------------------------------------------------+
5 rows in set (0.00 sec)

Since manticore is using mysql protocol, you can easily use any mysql client to connect and query. There is a well writen php library that encapsulates mysqli and provides orm-like interface to query as easy as possible.

Conclusion

In the next article, I will talk about real-time indexing, clustering and performance in more detail.

No Comments Yet