Recommended system sizing for workload

Overview

Beyond the minimum system requirements, Data Management system sizing is driven by the size of your data, workload complexity, workload frequency, and scale-out.

Persistent storage

The sizing and performance requirements for persistent storage are determined by how much data you want to store and how fast you need to move it.

Sizing

Sizing of persistent storage is generally straightforward, but keep the following in mind:

Some data formats are more efficient than others. If you have a choice, store data in Data Management’s native DLD format. It compresses about 3:1, is very fast, and stores complete metadata. XML and JSON will be the largest file formats, but can store hierarchical data. Parquet has very high compression, but uses a lot of memory, is slow to write, and has incomplete metadata. Gzip-compressed CSV compresses about 8:1 and is a good interchange format, but is slow to write.
You might need to store multiple versions of data, or to keep old and new versions in place for a short time during updates.
Data stored in relational and document databases does not count towards storage requirements.

Performance

Storage I/O performance is often not the limiting factor. If your projects read and write a lot of data to and from relational or document databases, that's likely to be the bottleneck.
Data Management projects generally read entire files from beginning to end using sequential I/O.
You may be running multiple Data Management projects at once. Scale your performance requirements to meet the time windows for expected data processing.
While bare-metal SSD/NVMe will be fastest, Data Management will rarely need that bandwidth. Most Data Management processing involves only sequential I/O, which is fairly fast even on cloud-managed SSD-based storage.
Be skeptical of cloud-managed spinning media storage, as this tends to be quite low-performance. Compare its rated sequential I/O read and write performance with the amount of data you need to move.
IOPS ratings are mostly irrelevant to Data Management, as they tend to describe a random/sequential workload mix more appropriate to databases. Focus on the sequential I/O throughput rating.

CPU

Adding CPU cores (or vCPUs in cloud environments) will make Data Management run faster, up to a point. A single Data Management project will rarely need more than 4 cores or vCPUs. However, the following processes tend to be CPU-intensive and may benefit from extra CPU power:

Address standardization
Geocoding
Fuzzy matching

The CPU requirements scale with the amount of work being done simultaneously. Each running project needs cumulative CPU power.

Finally, some Data Management activities exert CPU pressure even when they are not processing data:

If your system has many live automations running, you'll need an extra CPU core for every 10 live automations, including those in a "waiting" state.
Simply having projects and automations open in the client consumes resources. You’ll need another CPU core for every 10 projects and automations that are open in clients simultaneously.

What does RPDM consider to be a CPU core?

RPDM first counts the “virtual CPUs” reported by the OS.

If the CPU architecture is hyper-threaded, RPDM cores = virtual CPUs / 2
If the CPU architecture is NOT hyper-threaded, RPDM cores = virtual CPUs

What is a “virtual CPU”?

In a virtualized environment like VMWare, AWS, or Azure: This is determined by the VM definition
In a bare-metal hardware environment with a hyperthreaded architecture: This the number of CPU threads
In a bare-metal hardware environment without a hyperthreaded architecture: This the number of CPU cores

Physical memory (RAM)

The amount of physical memory (RAM) you need is determined by your data size, workload, and other factors.

Projects need enough memory to carry out their processing functions. Inadequate memory can drastically reduce performance or cause a project to fail. Configuring more memory for a dataflow project handling large volumes of data may improve its performance, but there will be a point of diminishing returns.
Certain tools and Data Management’s installable modules can require significant amounts of memory. Projects using multiple instances of these modules (for parallelism) will require proportionately more memory. See Memory usage for specifics.
Each project or automation that is open in the client requires approximately 150MB of memory, even when not running. Projects involving very wide records, or using many hundreds or thousands of tools, will need more.
Each live automation needs approximately 150MB-300MB of physical memory.
Each published Web Service needs approximately 150MB-300MB of physical memory.

Virtual memory

Virtual memory allows Data Management and other processes to utilize “reserved but not yet allocated” memory resources. Data Management uses more virtual memory than physical memory (RAM).

The minimum amount of virtual memory must be at least equal to the amount of physical memory.
Each open or running project and automation requires approximately 150MB of virtual memory.
Each open or running project and automation also requires an amount of virtual memory equal to the jvm_memory_mb setting specified in the properties file (and optionally configured on each project).
The jvm_memory_mb setting controls how much memory is available to the Java subsystem (including JDBC drivers), and you may need to increase it to handle certain file formats or JDBC drivers.
Each running project needs virtual memory for data processing (generally in the 1GB – 8GB range, depending on your project configurations).
A running project needs an additional 100MB of virtual memory, in addition to the physical memory needed to execute the project.

Windows and Linux handle virtual memory completely differently.

Windows

On Windows systems, the cumulative virtual memory reserved by all processes (called the commit charge) must be less than RAM + pagefile.sys size. In practice, this means that your pagefile might need to be very large. For example, if you expect your system to have 100 projects open with five of them running, this could easily push total virtual memory requirements past 64GB. In this case, if you have 32GB of RAM, your pagefile must be at least 32GB.

We recommend that you:

Leave your system configured to Automatically manage paging file size for all drives (the default).
Make sure that disk hosting your pagefile has plenty of free space.

Linux

Linux has a relaxed policy towards virtual memory: overcommit. Overcommit means that the “reserved but not yet allocated” virtual memory is allowed to grow unchecked. However, attempts to turn the virtual memory into real memory can fail abruptly and the OS will start to kill processes to avoid catastrophe. Fortunately, Data Management almost never uses that unallocated space. If overcommit is turned on, Data Management will use virtual memory equal to about 2x to 3x physical memory, and you do not need a swap partition.

It is possible to turn off overcommit. We do not recommend this. Most Linux installations turn on overcommit by default, but your organization (or cloud provider) may not.

To check your virtual memory overcommit setting, run the following command:

cat /proc/sys/vm/overcommit_memory

If the output is 0 or 1, your system allows overcommit of virtual memory, and no further action is required.
If the output is anything else, change it to 0 by running this command as root:

echo "0" > /proc/sys/vm/overcommit_memory

Restart your system after changing this value.

Linux threads

Data Management services and projects can require many threads (also known as lightweight user processes). An execution service on a loaded system may use 300+ threads, and a typical project or automation service will use 25-40 threads. If there are 100 projects and automations open on your system, this can result in more than 4000 threads! Typical Linux default settings limit the number of threads to 4096 for any given user across the entire host. This means that if all 100 projects are run as the same user (either the same Data Management user, or when Data Management is not configured for Advanced Security), the thread limit may be reached. When this happens, system stability is threatened and graceful recovery is difficult.

Data Management installations that hit thread limits may experience project failures, and the Data Management server trace logs will contain messages like failed to spawn thread or failed to spawn svc() thread. To avoid this problem, we recommend setting the thread limit to 8000.

To check the thread limit on Linux:

Log into the system as the user running the services or projects of interest. (When advanced security is on, the Administrator user is mapped to root by default.)
Run the commands:

ulimit -u -S
ulimit -u -H

The output of the first command is the "soft" limit, and that of the second command is the "hard" limit. These values should both be 8000 or higher. If they are lower, change them by editing system configuration files (usually the file /etc/security/limits.conf). Add the following lines:

* soft nproc 8000
* hard nproc 8000

Restart the system, and then test that the new limits are in effect by running the commands again:

ulimit -u -S
ulimit -u -H

Linux open file handles

Data Management services and projects can require many open file handles (also known as open file descriptors). Open file handles are used for both actual files, and for TCP sockets and other communication. If you system has many users and many open projects, you can run out of file handles.

To check the open file handles limit on Linux:

Log into the system as the user running the services or projects of interest. (When advanced security is on, the Administrator user is mapped to root by default.)
Run the commands:

ulimit -n -S
ulimit -n -H

The output of the first command is the "soft" limit, and that of the second command is the "hard" limit. These values should both be at 4000 or higher. If they are lower, change them by editing system configuration files (usually the file /etc/security/limits.conf). Add the following lines:

* soft nofiles 4000
* hard nofiles 4000

Restart the system, and then test that the new limits are in effect by running the commands again:

ulimit -n -S
ulimit -n -H

Windows desktop heap

On Windows systems, open or running projects and automations require the desktop heap resource. When the number of such projects or automations exceeds approximately 100, Data Management will fail to launch new processes and report errors. Contact Redpoint Global Inc. support for help changing this limit in your operating system.

Performance tips and memory settings

For CPU-intensive tasks like CASS address standardization, more and faster CPU cores will improve throughput. CASS and Geocoding need at least 2GB of memory per running project, in addition to memory normally used for operations like sorts and joins.
For disk-intensive tasks like ETL, sorting, joining, summarizing, and so on, more and faster temporary disk spaces will improve throughput, especially if using SSDs for temp space.
For record-matching and other hybrid tasks, both CPU and temporary disks are important.

You should generally configure your projects to use as many tool threads as you have CPU cores.

Recommended memory per simultaneous project to be run:

Without CASS or Geocoding: 2GB minimum
With CASS or Geocoding: 3GB minimum
For 100MM+ records in batch: add 2GB
For multiple instances of CASS, Geocoder, and SERP: add 1GB

More/larger data sets and complex projects will benefit from additional memory.