This is the message received from running a script to check if Tensorflow is working:
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:125] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:95] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I noticed that it has mentioned SSE4.2 and AVX,
- What are SSE4.2 and AVX?
- How do these SSE4.2 and AVX improve CPU computations for Tensorflow tasks.
- How to make Tensorflow compile using the two libraries?
Best Answer
I just ran into this same problem, it seems like Yaroslav Bulatov's suggestion doesn't cover SSE4.2 support, adding
--copt=-msse4.2
would suffice. In the end, I successfully built withwithout getting any warning or errors.
Probably the best choice for any system is:
(Update: the build scripts may be eating
-march=native
, possibly because it contains an=
.)-mfpmath=both
only works with gcc, not clang.-mfpmath=sse
is probably just as good, if not better, and is the default for x86-64. 32-bit builds default to-mfpmath=387
, so changing that will help for 32-bit. (But if you want high-performance for number crunching, you should build 64-bit binaries.)I'm not sure what TensorFlow's default for
-O2
or-O3
is.gcc -O3
enables full optimization including auto-vectorization, but that sometimes can make code slower.What this does:
--copt
forbazel build
passes an option directly to gcc for compiling C and C++ files (but not linking, so you need a different option for cross-file link-time-optimization)x86-64 gcc defaults to using only SSE2 or older SIMD instructions, so you can run the binaries on any x86-64 system. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). That's not what you want. You want to make a binary that takes advantage of all the instructions your CPU can run, because you're only running this binary on the system where you built it.
-march=native
enables all the options your CPU supports, so it makes-mavx512f -mavx2 -mavx -mfma -msse4.2
redundant. (Also,-mavx2
already enables-mavx
and-msse4.2
, so Yaroslav's command should have been fine). Also if you're using a CPU that doesn't support one of these options (like FMA), using-mfma
would make a binary that faults with illegal instructions.TensorFlow's
./configure
defaults to enabling-march=native
, so using that should avoid needing to specify compiler options manually.-march=native
enables-mtune=native
, so it optimizes for your CPU for things like which sequence of AVX instructions is best for unaligned loads.This all applies to gcc, clang, or ICC. (For ICC, you can use
-xHOST
instead of-march=native
.)