llama.cpp

aditya/llama.cpp

Fork 0

mirror of https://git.adityakumar.xyz/llama.cpp.git synced 2024-11-08 15:09:44 +00:00

Commit graph

Author	SHA1	Message	Date
Georgi Gerganov	53aba3f393	clang-tidy : restore dot file from accidental deletion	2023-06-08 10:09:08 +03:00
Kawrakow	4161bdc04d	metal : add Q4_K implementation (#1733 ) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 10:08:23 +03:00
slaren	553fd4d4b5	Add clang-tidy reviews to CI (#1407 )	2023-05-12 15:40:53 +02:00

Author

SHA1

Message

Date

Georgi Gerganov

53aba3f393

clang-tidy : restore dot file from accidental deletion

2023-06-08 10:09:08 +03:00

Kawrakow

4161bdc04d

metal : add Q4_K implementation (#1733 )

* Metal implementation for Q4_K

Very slow for now:
42 ms / token, Q4_0 runs in 28 ms/token on my
30-core M2 Max GPU.

* Optimizing Q4_K on metal

The first token always takes longer, I guess because
the metal kernel is being jit-compiled.
So, using n = 128 to measure time.

At this point Q4_K takes 29.5 ms / token
compared to 27.2 ms / token for Q4_0.
Quite a bit better than the initial attempt,
but still not good enough.

* Optimizing q4_K metal dot some more

For n = 256 it is now 28.1 ms/token compared to
27 ms/token for q4_0.

* Fix after merge with master

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2023-06-08 10:08:23 +03:00

slaren

553fd4d4b5

Add clang-tidy reviews to CI (#1407 )

2023-05-12 15:40:53 +02:00

3 commits