Commit graph

24 commits

Author SHA1 Message Date
Georgi Gerganov
ecb217db4f
llama : Metal inference (#1642)
* mtl : export the LLaMA computation graph

* ci : disable temporary

* mtl : adapt the MNIST example as starter

* mtl : no need for mtl-export tool, add cli arg for main instead

* mtl : export just a small part of the graph for now to make it easier

* mtl : move MSL code into separate file for easy editing

* mtl : initial get_rows_q4_0 kernel

* mtl : confirmed get_rows_q4_0 is working correctly

* mtl : add rms_norm kernel + confirm working

* mtl : add mul kernel + confirm working

* mtl : initial mul_mat Q4 kernel (wrong results)

* mtl : mul_mat fixes (still wrong)

* mtl : another mul_mat Q4 (still does not work)

* mtl : working mul_mat q4

* ggml : fix handling of "view" ops in ggml_graph_import()

* mtl : add rope kernel

* mtl : add reshape and transpose handling

* ggml : store offset as opt arg for ggml_view_xd() operators

* mtl : add cpy kernel + handle view ops

* mtl : confirm f16 x f32 attention mul mat

* mtl : add scale kernel

* mtl : add diag_mask_inf kernel

* mtl : fix soft_max kernel

* ggml : update ggml_nbytes() to handle non-contiguous tensors

* mtl : verify V tensor contents

* mtl : add f32 -> f32 cpy kernel

* mtl : add silu kernel

* mtl : add non-broadcast mul kernel

* mtl : full GPU inference of the computation graph

* mtl : optimize rms_norm and soft_max kernels

* mtl : add f16 mat x f32 vec multiplication kernel

* mtl : fix bug in f16 x f32 mul mat + speed-up computation

* mtl : faster mul_mat_q4_0_f32 kernel

* mtl : fix kernel signature + roll inner loop

* mtl : more threads for rms_norm + better timing

* mtl : remove printfs from inner loop

* mtl : simplify implementation

* mtl : add save/load vocab to ggml file

* mtl : plug Metal inference into llama.cpp (very quick-n-dirty)

* mtl : make it work with main example

Lots of hacks but at least now it generates text

* mtl : preparing for merge

* mtl : clean-up ggml mtl interface + suport scratch / inplace

* mtl : remove temp / debug code

* metal : final refactoring and simplification

* Revert "ci : disable temporary"

This reverts commit 98c267fc77fe811082f672538fc91bcfc9072d63.

* metal : add comments

* metal : clean-up stuff, fix typos

* readme : add Metal instructions

* readme : add example for main
2023-06-04 23:34:30 +03:00
Vladimir Zorin
337aea1139
examples : add --alias option to gpt_params to set use friendly model name (#1614) 2023-05-28 20:14:24 +03:00
Georgi Gerganov
4b7e245adf
minor : fix compile warnings 2023-05-19 20:14:51 +03:00
Stephan Walter
dc271c52ed
Remove unused n_parts parameter (#1509) 2023-05-17 22:12:01 +00:00
András Salamon
9560655409
define default model path once, sync path with readme (#1366) 2023-05-16 17:46:34 +02:00
Johannes Gäßler
905d87b70a
ggml : GPU-accelerated token generation (#1412)
* CUDA kernel for q4_0 dequant. + mat. vec. mult.

* Added q4_1 via template

* Added missing __syncthreads();

* --gpu_layers -> --gpu-layers

* Shorter dequantize_mul_mat_vec line

* q5_0 dequantize_mul_mat kernel

* More readable dequantize_mul_mat_vec logic

* dequantize_mul_mat_vec kernels for q5_1, q8_0, f16

* llama : offload "output" tensor to GPU too + coding style fixes

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2023-05-13 16:38:36 +03:00
Evan Jones
cf348a60e0
main : add option to save full output to session (#1338)
* main : add option to save full output to session

* split behavior into --session and --prompt-cache

* restore original implementation with new names

* PR comments

* move the check for incompatible parameters to gpt_params_parse

* Fix whitespace

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

---------

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
2023-05-10 11:37:14 -04:00
DannyDaemonic
41654efea8
Interface improvements and --multiline-input (previously --author-mode) (#1040)
* Interface improvements
* Multiline input
* Track character width
* Works with all characters and control codes + Windows console fixes
2023-05-08 19:45:48 -07:00
44670
2edbdb0f99
main : add --in-suffix option (#1318)
* adding --in-suffix option

* print input suffix before generation
2023-05-04 18:41:12 +03:00
Ron Evans
67c77799e0
examples : add llama_init_from_gpt_params() common function (#1290)
Signed-off-by: deadprogram <ron@hybridgroup.com>
2023-05-02 23:39:51 +03:00
jon-chuang
a5d30b1f53
common : better default number of threads (#934)
* commit

* fix

* try-catch

* apply code review

* improve

* improve

* add macos headers

* done

* remove color

* fix windows

* minor

* fix

* Apply suggestions from code review

Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>

* remove

* minor

* minor

---------

Co-authored-by: jon-chuang <jon-chuang@users.noreply.github.com>
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
2023-04-30 21:41:35 +03:00
Georgi Gerganov
334637e43e
common : change default parameters to pre-#1126 (#1223) 2023-04-29 09:51:06 +03:00
Ivan Stepanov
dd7eff57d8
llama : new sampling algorithms (#1126)
* Sample interface, new samplers.

New samplers:
- locally typical sampling
- tail free sampling
- frequency and presence penalty
- mirostat

Ignore EOS fix: -inf should be used.

* mirostat

* Added --logit-bias and --no-penalize-nl, removed std::span

* Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)

Use C++11, clarify llama API documentation, rename Mirostat parameters to --mirostat_lr and --mirostat_ent, add temperature sampling for Mirostat, simplify Mirostat sampling API parameters (removed N and *k)

* Save and load example adjust

* Tests

* Windows build fix

* Windows test fix
2023-04-29 08:34:41 +03:00
Evan Jones
1481a9cf25
llama : add session file format and saved sessions in main (#1169) 2023-04-28 18:59:37 +03:00
mgroeber9110
9b0a4d4214
examples/main README improvements and some light refactoring (#1131) 2023-04-24 15:45:32 +00:00
eiery
10f19c1121
llama : have n_batch default to 512 (#1091)
* set default n_batch to 512 when using BLAS

* spacing

* alternate implementation of setting different n_batch for BLAS

* set n_batch to 512 for all cases
2023-04-22 11:27:05 +03:00
slaren
315a95a4d3
Add LoRA support (#820) 2023-04-17 17:28:55 +02:00
Pavol Rusnak
c85e03d12e
Revert "main : alternative instruct mode (Vicuna support, etc.) (#863)" (#982)
This reverts commit f4d277ae17.
2023-04-14 22:58:43 +03:00
Tomáš Pazdiora
f4d277ae17
main : alternative instruct mode (Vicuna support, etc.) (#863)
* Add support for configs, add configurable prefixes / suffixes, deprecate instruct mode, add stop prompt

* Add multiline mode, update text input.

* bugfix

* update implementation

* typos

* Change --multiline implementation to be toggled by EOF.

* bugfix

* default multiline mode

* add more configs

* update formating

* update formatting

* apply suggestions
2023-04-14 18:19:17 +03:00
comex
f963b63afa Rewrite loading code to try to satisfy everyone:
- Support all three formats (ggml, ggmf, ggjt).  (However, I didn't
  include the hack needed to support GPT4All files without conversion.
  Those can still be used after converting them with convert.py from my
  other PR.)

- Support both mmap and read (mmap is used by default, but can be
  disabled with `--no-mmap`, and is automatically disabled for pre-ggjt
  files or on platforms where mmap is not supported).

- Support multi-file models like before, but automatically determine the
  number of parts rather than requiring `--n_parts`.

- Improve validation and error checking.

- Stop using the per-file type field (f16) entirely in favor of just
  relying on the per-tensor type/size fields.  This has no immediate
  benefit, but makes it easier to experiment with different formats, and
  should make it easier to support the new GPTQ-for-LLaMa models in the
  future (I have some work in progress on that front).

- Support VirtualLock on Windows (using the same `--mlock` option as on
  Unix).

    - Indicate loading progress when using mmap + mlock.  (Which led me
      to the interesting observation that on my Linux machine, with a
      warm file cache, mlock actually takes some time, whereas mmap
      without mlock starts almost instantly...)

      - To help implement this, move mlock support from ggml to the
        loading code.

- madvise/PrefetchVirtualMemory support (based on #740)

- Switch from ifstream to the `fopen` family of functions to avoid
  unnecessary copying and, when mmap is enabled, allow reusing the same
  file descriptor for both metadata reads and mmap (whereas the existing
  implementation opens the file a second time to mmap).

- Quantization now produces a single-file output even with multi-file
  inputs (not really a feature as much as 'it was easier this way').

Implementation notes:

I tried to factor the code into more discrete pieces than before.

Regarding code style: I tried to follow the code style, but I'm naughty
and used a few advanced C++ features repeatedly:

- Destructors to make it easier to ensure everything gets cleaned up.

- Exceptions.  I don't even usually use exceptions when writing C++, and
  I can remove them if desired... but here they make the loading code
  much more succinct while still properly handling a variety of errors,
  ranging from API calls failing to integer overflow and allocation
  failure.  The exceptions are converted to error codes at the
  API boundary.)

Co-authored-by: Pavol Rusnak <pavol@rusnak.io> (for the bit I copied from #740)
2023-04-10 01:10:46 +02:00
Tomáš Pazdiora
aaf3b23deb
fix for windows utf-8 input (#840)
Use UTF-16 as input on Windows, since UTF-8 does not work and reads multibyte characters as zeros
2023-04-08 17:49:39 +02:00
anzz1
7b8dbcb78b
main.cpp fixes, refactoring (#571)
- main: entering empty line passes back control without new input in interactive/instruct modes
- instruct mode: keep prompt fix
- instruct mode: duplicate instruct prompt fix
- refactor: move common console code from main->common
2023-03-28 17:09:55 +03:00
Georgi Gerganov
e2d490dafd
Inifinite generation via context swapping (#71) 2023-03-25 21:36:22 +02:00
Georgi Gerganov
a316a425d0
Overhaul the examples structure
- main -> examples
- utils -> examples (renamed to "common")
- quantize -> examples
- separate tools for "perplexity" and "embedding"

Hope I didn't break something !
2023-03-25 20:26:40 +02:00
Renamed from utils.h (Browse further)