* Major refactoring - introduce C-style API
* Clean up
* Add <cassert>
* Add <iterator>
* Add <algorithm> ....
* Fix timing reporting and accumulation
* Measure eval time only for single-token calls
* Change llama_tokenize return meaning
* Update Makefile to detect AVX512 support and add compiler flags if it's available
* Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
* Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
* Use built-in AVX512 horizontal reduce add to get sum at the end
* Manual unrolling on inner dot product loop to reduce loop counter overhead
The readme tells people to use the command line option "-t 8", causing 8
threads to be started. On systems with fewer than 8 cores, this causes a
significant slowdown. Remove the option from the example command lines
and use /proc/cpuinfo on Linux to determine a sensible default.
* Add AVX2 version of ggml_vec_dot_q4_1
* Small optimisations to q4_1 dot product (@Const-me)
* Rearrange Q4_1 quantization to work for multipart models. (Fix#152)
* Fix ggml_vec_mad_q4_1 too
* Fix non-vectorised q4_1 vec mul
* Don't use vdotq_s32 if it's not available
`dotprod` extensions aren't available on some ARM CPUs (e.g. Raspberry Pi 4), so check for them and only use them if they're available.
Reintroduces the code removed in 84d9015 if `__ARM_FEATURE_DOTPROD` isn't defined.
* Update ggml.c
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply fixes suggested to build on windows
Issue: https://github.com/ggerganov/llama.cpp/issues/22
* Remove unsupported VLAs
* MSVC: Remove features that are only available on MSVC C++20.
* Fix zero initialization of the other fields.
* Change the use of vector for stack allocations.