I doing a lot of dgemm calls to compute
C = A*B
The matrices are fairly small i.e. C might have 1000 rows and less than 200 columns. Typically I do
- Store A in row major form.
- Store B in column major form
- Choose C have 56 columns.
- Make sure everything is aligned.
Believing this leads to a good performance. In fact I can control how many columns C has so I could make 64 or 128 for instance. So now my questions are:
- What is optimal blocking i.e. how many columns in C is optimal? Can the blocking be determined algorithmicly
- How should I formulate the matrix multiplication so MKL can work directly with the data, so the overhead of buffer management etc. is avoided?
Your documentation does not seem to answer such questions. Well, I might have overlooked something.