* Implement non-greedy tokenizer that tries to maximize token lengths
* Insert single space in front of the prompt
- this is to match original llama tokenizer behavior
---------
Co-authored-by: Jakub Horak <jakub.horak@ibawizard.net>
The readme tells people to use the command line option "-t 8", causing 8
threads to be started. On systems with fewer than 8 cores, this causes a
significant slowdown. Remove the option from the example command lines
and use /proc/cpuinfo on Linux to determine a sensible default.
* Add AVX2 version of ggml_vec_dot_q4_1
* Small optimisations to q4_1 dot product (@Const-me)
* Rearrange Q4_1 quantization to work for multipart models. (Fix#152)
* Fix ggml_vec_mad_q4_1 too
* Fix non-vectorised q4_1 vec mul
* added ctx_size parameter
* added it in more places
* Apply suggestions from code review
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Initial work on interactive mode.
* Improve interactive mode. Make rev. prompt optional.
* Update README to explain interactive mode.
* Fix OS X build
* Add back top_k
* Update utils.cpp
* Update utils.h
---------
Co-authored-by: Bill Hamilton <bill.hamilton@shopify.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply fixes suggested to build on windows
Issue: https://github.com/ggerganov/llama.cpp/issues/22
* Remove unsupported VLAs
* MSVC: Remove features that are only available on MSVC C++20.
* Fix zero initialization of the other fields.
* Change the use of vector for stack allocations.
* Adding repeat penalization
* Update utils.h
* Update utils.cpp
* Numeric fix
Should probably still scale by temp even if penalized
* Update comments, more proper application
I see that numbers can go negative so a fix from a referenced commit
* Minor formatting
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>