Tensorflow – How to compile Tensorflow with SSE4.2 and AVX instructions

compiler-optimizationcompiler-optionssimdtensorflowx86

This is the message received from running a script to check if Tensorflow is working:

I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

I noticed that it has mentioned SSE4.2 and AVX,

What are SSE4.2 and AVX?
How do these SSE4.2 and AVX improve CPU computations for Tensorflow tasks.
How to make Tensorflow compile using the two libraries?

Best Answer

I just ran into this same problem, it seems like Yaroslav Bulatov's suggestion doesn't cover SSE4.2 support, adding --copt=-msse4.2 would suffice. In the end, I successfully built with

bazel build -c opt --copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.2 --config=cuda -k //tensorflow/tools/pip_package:build_pip_package

without getting any warning or errors.

Probably the best choice for any system is:

bazel build -c opt --copt=-march=native --copt=-mfpmath=both --config=cuda -k //tensorflow/tools/pip_package:build_pip_package

(Update: the build scripts may be eating -march=native, possibly because it contains an =.)

-mfpmath=both only works with gcc, not clang. -mfpmath=sse is probably just as good, if not better, and is the default for x86-64. 32-bit builds default to -mfpmath=387, so changing that will help for 32-bit. (But if you want high-performance for number crunching, you should build 64-bit binaries.)

I'm not sure what TensorFlow's default for -O2 or -O3 is. gcc -O3 enables full optimization including auto-vectorization, but that sometimes can make code slower.

What this does: --copt for bazel build passes an option directly to gcc for compiling C and C++ files (but not linking, so you need a different option for cross-file link-time-optimization)

x86-64 gcc defaults to using only SSE2 or older SIMD instructions, so you can run the binaries on any x86-64 system. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). That's not what you want. You want to make a binary that takes advantage of all the instructions your CPU can run, because you're only running this binary on the system where you built it.

-march=native enables all the options your CPU supports, so it makes -mavx512f -mavx2 -mavx -mfma -msse4.2 redundant. (Also, -mavx2 already enables -mavx and -msse4.2, so Yaroslav's command should have been fine). Also if you're using a CPU that doesn't support one of these options (like FMA), using -mfma would make a binary that faults with illegal instructions.

TensorFlow's ./configure defaults to enabling -march=native, so using that should avoid needing to specify compiler options manually.

-march=native enables -mtune=native, so it optimizes for your CPU for things like which sequence of AVX instructions is best for unaligned loads.

This all applies to gcc, clang, or ICC. (For ICC, you can use -xHOST instead of -march=native.)

Related Solutions

Python – way to suppress the messages TensorFlow prints

UPDATE (beyond 1.14): see my more thorough answer here (this is a dupe question anyway): https://stackoverflow.com/a/38645250/6557588

In addition to Wintro's answer, you can also disable/suppress TensorFlow logs from the C side (i.e. the uglier ones starting with single characters: I, E, etc.); the issue open regarding logging has been updated to state that you can now control logging via an environmental variable. You can now change the level by setting the environmental variable called TF_CPP_MIN_LOG_LEVEL; it defaults to 0 (all logs shown), but can be set to 1 to filter out INFO logs, 2 to additionally filter out WARNING logs, and 3 to additionally filter out ERROR logs. It appears to be in master now, and will likely be a part of future version (i.e. versions after r0.11). See this page for more information. Here is an example of changing the verbosity using Python:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # or any {'0', '1', '2'}
import tensorflow as tf

You can set this environmental variable in the environment that you run your script in. For example, with bash this can be in the file ~/.bashrc, /etc/environment, /etc/profile, or in the actual shell as:

TF_CPP_MIN_LOG_LEVEL=2 python my_tf_script.py

Tensorflow – Re-build Tensorflow with desired optimization flags

By following the NVIDIA instructions you are resetting the TensorFlow repository to an older commit, before the SIMD instructions optimization was made available (1.0r):

git reset --hard 70de76e

This commit dates back to a previous release when this feature was not yet implemented, so it is actually working as it's supposed to.

The solution is to follow the official TensorFlow documentation.

For future situations, it is always recommended to use official resources before reaching out for third party solutions, as more they may be helpful, the official ones are more reliable and better maintained.

Notice during configure you're not prompted which CPU instructions you want to build TF with as it should due to the reason above, and therefore, you're unable to build with them.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:

Follow the official docs accordingly and it will work. If you have any follow up questions feel free to ask or if you face any problems open an issue on Github :)

Best Answer

Related Solutions

Python – way to suppress the messages TensorFlow prints

Tensorflow – Re-build Tensorflow with desired optimization flags

Related Topic