63597

Need MySQL optimization for complex search on EAV structured data

Question:

I have a large database with EAV structured data that has to be searchable and pageable. I tried every trick in my book to get it fast enough, but under certain circumstances, it still fails to complete in a reasonable time.

This is my table structure (relevant parts only, ask away if you need more):

CREATE TABLE IF NOT EXISTS `object` ( `object_id` bigint(20) NOT NULL AUTO_INCREMENT, `oid` varchar(32) CHARACTER SET utf8 NOT NULL, `status` varchar(100) CHARACTER SET utf8 DEFAULT NULL, `created` datetime NOT NULL, `updated` datetime NOT NULL, PRIMARY KEY (`object_id`), UNIQUE KEY `oid` (`oid`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE IF NOT EXISTS `version` ( `version_id` bigint(20) NOT NULL AUTO_INCREMENT, `type_id` bigint(20) NOT NULL, `object_id` bigint(20) NOT NULL, `created` datetime NOT NULL, `status` varchar(100) CHARACTER SET utf8 DEFAULT NULL, PRIMARY KEY (`version_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE IF NOT EXISTS `value` ( `value_id` bigint(20) NOT NULL AUTO_INCREMENT, `object_id` int(11) NOT NULL, `attribute_id` int(11) NOT NULL, `version_id` bigint(20) NOT NULL, `type_id` bigint(20) NOT NULL, `value` text NOT NULL, PRIMARY KEY (`value_id`), KEY `field_id` (`attribute_id`), KEY `action_id` (`version_id`), KEY `form_id` (`type_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This is a sample object. I have around <strong>1 million</strong> of those in my database. each object may have different number of attributes with different attribute_id

INSERT INTO `owner` (`owner_id`, `uid`, `status`, `created`, `updated`) VALUES (1, 'cwnzrdxs4dzxns47xs4tx', 'Green', NOW(), NOW()); INSERT INTO `object` (`object_id`, `type_id`, `owner_id`, `created`, `status`) VALUES (1, 1, 1, NOW(), NOW()); INSERT INTO `value` (`value_id`, `owner_id`, `attribute_id`, `object_id`, `type_id`, `value`) VALUES (1, 1, 1, 1, 1, 'Munich'); INSERT INTO `value` (`value_id`, `owner_id`, `attribute_id`, `object_id`, `type_id`, `value`) VALUES (2, 1, 2, 1, 1, 'Germany'); INSERT INTO `value` (`value_id`, `owner_id`, `attribute_id`, `object_id`, `type_id`, `value`) VALUES (3, 1, 3, 1, 1, '123'); INSERT INTO `value` (`value_id`, `owner_id`, `attribute_id`, `object_id`, `type_id`, `value`) VALUES (4, 1, 4, 1, 1, '2012-01-13'); INSERT INTO `value` (`value_id`, `owner_id`, `attribute_id`, `object_id`, `type_id`, `value`) VALUES (5, 1, 5, 1, 1, 'A cake!'); <hr />

Now on to my current mechanism. My first try was the typical approach to Mysql. Do one huge SQL with loads of joins on anything I require. Complete desaster! Took way to long to load and even crashed the PHP and MySQL servers due to exhausted RAM.

So I split my queries up into several steps:

<strong>1 Determine all needed attribute_ids.</strong>

I can look them up in another table that references the type_id of an object. The result is a list of attribute_ids. (this table is not very relevant to the performance, so it's not included in my sample.)

:type_id contains all type_ids from any objects I want to include in my search. I already got this information in my application. So this is inexpensive.

SELECT * FROM attribute WHERE form_id IN (:type_id)

Result is an array of type_id integers.

<strong>2 Search for matching objects</strong> A big SQL query is compiled that adds one INNER JOIN for each and every condition I want. This sounds horrible, but in the end, it was the fastest method :(

A typical generated query might look like this. The LIMIT sadly is necessary or I will potentially get so many IDs that the resulting array makes PHP explode or break the IN statement in the next Query:

SELECT DISTINCT `version`.object_id FROM `version` INNER JOIN `version` AS condition1 ON `version`.version_id = condition1.version_id AND condition1.created = '2012-03-04' -- Filter by version date INNER JOIN `value` AS condition2 ON `version`.version_id = condition2.version_id AND condition2.type_id IN (:type_id) -- try to limit joins to object types we need AND condition2.attribute_id = :field_id2 -- searching for a value in a specific attribute AND condition2.value = 'Munich' -- searching for the value 'Munich' INNER JOIN `value` AS condition3 ON `version`.version_id = condition3.version_id AND condition3.type_id IN (:type_id) -- try to limit joins to object types we need AND condition3.attribute_id = :field_id3 -- searching for a value in a specific attribute AND condition3.value = 'Green' -- searching for the value 'Green' WHERE `version`.type_id IN (:type_id) ORDER BY `version`.version_id DESC LIMIT 10000

The result will contain all object_ids from any object I might need. I am selecting object_ids and not version_ids as I need to have all versions of matching objects, regardless of which version matched.

<strong>3 Sort and page results</strong> Next I will create a query that sorts the objects by a certain attribute and then pages the resulting array.

SELECT DISTINCT object_id FROM value WHERE object_id IN (:foundObjects) AND attribute_id = :attribute_id_to_sort AND value > '' ORDER BY value ASC LIMIT :limit OFFSET :offset

The result is a sorted and paged list of object ids from former search

<strong>4 Get our complete objects, versions and attributes</strong> In the last step, I will select all values for any objects and versions the former queries found.

SELECT `value`.*, `object`.*, `version`.*, `type`.* `object`.status AS `object.status`, `object`.flag AS `object.flag`, `version`.created AS `version.created`, `version`.status AS `version.status`, FROM version INNER JOIN `type` ON `version`.form_id = `type`.type_id INNER JOIN `object` ON `version`.object_id = `object`.object_id LEFT JOIN value ON `version`.version_id = `value`.version_id WHERE version.object_id IN (:sortedObjectIds) AND `version.type_id IN (:typeIds) ORDER BY version.created DESC

The result will then be compiled via PHP into nice object->version->value array structures.

<hr />

<strong>Now the question</strong>:

<ul><li>Can this whole mess be accelerated in any way?</li> <li>Can I somehow remove the LIMIT 10000 restriction from my search query?</li> </ul>

If all else fails, maybe switch database technology? See my other question: <a href="https://stackoverflow.com/questions/21390391/database-optimized-for-searching-in-large-number-of-objects-with-different-attri" rel="nofollow">Database optimized for searching in large number of objects with different attributes</a>

<hr />

<strong>Real Life samples</strong>

Table sizes: object - 193801 rows, version - 193841 rows, value - 1053928 rows

SELECT * FROM attribute WHERE attribute_id IN (30) SELECT DISTINCT `version`.object_id FROM version INNER JOIN value AS condition_d4e328e33813 ON version.version_id = condition_d4e328e33813.version_id AND condition_d4e328e33813.type_id IN (30) AND condition_d4e328e33813.attribute_id IN (377) AND condition_d4e328e33813.value LIKE '%e%' INNER JOIN value AS condition_2c870b0a429f ON version.version_id = condition_2c870b0a429f.version_id AND condition_2c870b0a429f.type_id IN (30) AND condition_2c870b0a429f.attribute_id IN (376) AND condition_2c870b0a429f.value LIKE '%s%' WHERE version.type_id IN (30) ORDER BY version.version_id DESC LIMIT 10000 -- limit to 10000 or it breaks!

Explain:

id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE condition_2c870b0a429f ref field_id,action_id,form_id field_id 4 const 178639 Using where; Using temporary; Using filesort 1 SIMPLE action eq_ref PRIMARY PRIMARY 8 condition_2c870b0a429f.action_id 1 Using where 1 SIMPLE condition_d4e328e33813 ref field_id,action_id,form_id action_id 8 action.action_id 11 Using where; Distinct

objects search completed (Peak RAM: 5.91MB, Time: 4.64s)

SELECT DISTINCT object_id FROM version WHERE object_id IN (193793,193789, ... ,135326,135324) -- 10000 ids in here! ORDER BY created ASC LIMIT 50 OFFSET 0

objects sort completed (Peak RAM: 6.68MB, Time: 0.352s)

SELECT `value`.*, object.*, version.*, type.*, object.status AS `object.status`, object.flag AS `object.flag`, version.created AS `version.created`, version.status AS `version.status`, version.flag AS `version.flag` FROM version INNER JOIN type ON version.type_id = type.type_id INNER JOIN object ON version.object_id = object.object_id LEFT JOIN value ON version.version_id = `value`.version_id WHERE version.object_id IN (135324,135326,...,135658,135661) AND version.type_id IN (30) ORDER BY quality DESC, version.created DESC

objects load query completed (Peak RAM: 6.68MB, Time: 0.083s)<br /> objects compilation into arrays completed (Peak RAM: 6.68MB, Time: 0.007s)

Answer1:

Just try to add an EXPLAIN before your search query :

EXPLAIN SELECT DISTINCT `version`.object_id FROM `version`, etc ...

then check the results in the "Extra" column, it will give you some clues to speedup your query, like adding INDEX on the right fields.

Also some times you can removeINNER JOIN, get more results in your Mysql response and filter the big array by processing with PHP loops.

Answer2:

I would start by trying to have covering indexes (ie: all columns to match the criteria you are querying on and even pulling out as result). This way the engine does not have to go back to the raw page data.

Since you need the "object_id" from version, and using the "version_id" as join basis to the other tables. Your version table also has a WHERE clause on the TYPE_ID, so I would have an index on

version table -- (object_id, version_id, type_id)

For your "value" table, match there too for criteria

value table -- ( version_id, type_id, attribute_id, value, created )

Recommend

  • MYSQLi bind_result allocates too much memory
  • Android Mapview: Control ordering of multiple types of OverlayItems?
  • FPDF Fatal error: Allowed memory size of 33554432 bytes exhausted (PHP)
  • scala.MatchError: java.lang.StackOverflowError (of class java.lang.StackOverflowError)
  • Eclipse won't start
  • SQL Azure SPLIT AT Backend Process and Resource Throttling
  • limited threads in soapUI free version
  • Program crashes when run outside IDE
  • ASP.NET, C# How to Pass a StringQuery to a custom SQL Command
  • how to increment a message header
  • Mongodb update() vs. findAndModify() performace
  • How to determine the CCSID used in CPYFRMIMPF command?
  • Python: Split a String Field into 3 Separate Fields using Lambda
  • It is possible use the same sql azure instance from two different cloud service of two different sub
  • TSQL Rolling Average of Time Groupings
  • Send emails through VB6 if no email client
  • how can I compare dates in array to find the earliest one?
  • How to discover Font Type?
  • Divide a $1 by 3 and adjusting 1 cent
  • How do I import an existing Grails 3 (3.0.12) project in IntelliJ 15
  • Cypher - matching two different possible paths and return both
  • C# - Most efficient way to iterate through multiple arrays/list
  • MySQL performance when updating row with FK
  • Consuming a WCF service in a Java Client using wsHttpBinding
  • Rest Services conventions
  • How do I shift the decimal place in Python?
  • Sequential (transactional) API calls in angular 4 with state management
  • Access Android Market through SSH tunnel
  • Make VS2015 use angular-cli ng at build time in a .NET project
  • Row Count Is Returning the incorrect number using RaptureXML
  • The plugin 'org.apache.maven.plugins:maven-jboss-as-plugin' does not exist or no valid ver
  • How to rebase a series of branches?
  • Is my CUDA kernel really runs on device or is being mistekenly executed by host in emulation?
  • JavaScriptCore crash on iOS9
  • SetUp method failed while running tests from teamcity
  • What are the advantages and disadvantages of reading an entire file into a single String as opposed
  • Turn off referential integrity in Derby? is it possible?
  • Add sale price programmatically to product variations
  • Unable to use reactive element in my shiny app
  • How do I use LINQ to get all the Items that have a particular SubItem?