Zovy maintains some of the largest and most successful IDOL implementations in the world, both as part of our products, at our customer’s facilities, and hosted in our data centers. One of the keys to making IDOL perform well is a deep understanding of both the software and the hardware aspects (and their interaction) that drive the system. Through our experience, we have developed many best practices and processes that we deploy to ensure success. Our experience with IDOL spans the globe, with customers throughout North America, Europe, the Middle East, and Asia. This gives us a range of experience in diverse products and use cases, from Enterprise Search, Media Monitoring, Digital Asset Management to Archiving, eDiscovery, Compliance and Surveillance/Supervision.
During its existence IDOL has been incorporated into many software products, covering areas such as Enterprise Search, Web Content Management, Document Management, Media Monitoring, Archiving, Compliance tools, Legal Discovery, Contact Centre Management and Information Management & Governance. This is a credit to the flexibility of the design such that the core processing engine remains unchanged but links to the peripheral connectors which understand and collect source data.
This evolution is possible because, over time, IDOL has developed to incorporate new features and sources from which to collect and process. As audio and video became relevant, IDOL integrated language packs and video processors, new internet-based streams such as Facebook, Twitter, Instagram, and Youtube became sources and support for popular new business products such as salesforce, confluence, service now and objective have allowed new, disparate sources to be consolidated into a single source of truth. As cloud and big data has become relevant, AmazonS3, Azure Blob, Hadoop connectors have been added to IDOL.
Regardless of the implementation, there are a number of best practices for managing and maintaining your core IDOL index data. I define these into 3 main areas, Performance, Reliability, and Efficiency and today I want to talk about one of these areas: Performance, specifically indexing performance.
As the IDOL index grows in size its query performance can drop off exponentially after the point where the percentage of the index loaded into RAM is too small to provide fast results and the disk index is constantly used to provide the result set. This does not occur at a predictable point but is related to the configuration, available server RAM and disk performance. For a typical email/document index on a typical corporate server, this appears at around 5-6 million documents and 150GB disk index per content engine.
Around this point, RAM and disk server resources become the limits to performance. Implementations, where the queries are predictable, regular and repeatable (e.g. Medical), are more likely to allow you to push the envelope, as in these cases, for a given index size on a standard server the RAM cache is more likely to still hold the results from the previous queries. Random queries (such as email archive or Enterprise search) will benefit most from the maximum RAM used to hold the index data, and in this case, the disk read performance becomes more important.
The biggest hit to performance comes from indexing data, as this is likely to be occurring at the same time that queries are being run. Take a look at a typical disk usage chart for an IDOL system as it fills up over time:
The spikes in usage are due to the regular flush to the disk of the newly indexed data. Depending on the configuration of ‘repository storage’ (default = true), indexing even a single document into an IDOL content engine requires a complete rebuild of the index tables once the index is ‘flushed’ to disk, this is, committed to the disk index. This is highly disk intensive and as you can expect, the larger the index (and the slower the disk), the longer it takes to ‘flush’ and during this time the engine will be effectively unavailable or slowed when running queries (again, depending on configuration).
Therefore, to minimize these ‘flushes’ the indexcachemaxsize setting (default 102400 or 100MB) is used to define the RAM cache for the disk index. This works by allowing the newly indexed data to be rebuilt into a new index in RAM before being written to disk. The larger the indexcache the more data can be added to the index before a disk flush is required. Once the indexcache is full, (or a flush request is sent) a flush to disk will occur. As a flush will take the same time to complete whether adding a single new document, or 100,000, (as the new document(s) will have links and patterns relevant to other documents), it makes sense to have an indexcachemaxsize setting that allows you to maximize the build-in RAM over the disk flush. The limit to all of this is, of course, the amount of RAM available and the fact that until the index cache is flushed to disk, the newly indexed items cannot be queried. Therefore, you need to balance the efficiency of sizing an index cache against the RAM required for the index proper (the query caches) and the timeliness of needing to query the new data being indexed.
In the below diagram, you can see that the indexer has reserved 104MB (the middle blue ring in the graph) and how this size compares to the other IDOL memory pools in use (in this case, a significant 70% of the total used) :
In larger implementations, it can be the case that at least one content engine is flushing to disk at any given time, and as this affects the query performance, having multiple content engines indexes on a single disk can mean the flushing process affects the query performance of the other engines causing a constant reduction in overall query performance. The solutions to this can be to split the engines among different physical disks, apply a ‘flushlock’ file or configure IDOL to only index into a fixed number of engines at any given time… something we can go into another time.
For now, I hope that from this discussion you can see that the interplay of server RAM, disk and configuration can have a significant effect on the performance of your index and that there are a number of efforts that can be done to mitigate or remove the effects, the foremost of these being to have initially designed an IDOL infrastructure which takes account of your future needs and performs well all the way to the edge.