My initial experiments offloading MKL FFTs into the MIC (using C language in Linux) give me approximately 9.3 GFLOPS of performance, judging by the reported [MIC Time] numbers when I set the environment variable OFFLOAD_REPORT to 1 (or 2). This is about 0.46% of the advertized peak performance of 2 TFLOPS. But in fact, it is much less than that if I take into account the time for the data movement inside the offload section [CPU Time in the "report").
Am I missing something?
I am curious to know if my numbers are way off or consistent with other benchmarks (I could not find any).
I would appreciate it if someone could point me to related information or to know if someone had a different (or similar) experience.
The bottom line is that I hope I need to do something to drastically improve its performance, but I ran out of ideas. Any help will be appreciated.
Thanks!
Fernando