* MPI support, first cut
* fix warnings, update README
* fixes
* wrap includes
* PR comments
* Update CMakeLists.txt
* Add GH workflow, fix test
* Add info to README
* mpi : trying to move more MPI stuff into ggml-mpi (WIP) (#2099)
* mpi : add names for layer inputs + prep ggml_mpi_graph_compute()
* mpi : move all MPI logic into ggml-mpi
Not tested yet
* mpi : various fixes - communication now works but results are wrong
* mpi : fix output tensor after MPI compute (still not working)
* mpi : fix inference
* mpi : minor
* Add OpenMPI to GH action
* [mpi] continue-on-error: true
* mpi : fix after master merge
* [mpi] Link MPI C++ libraries to fix OpenMPI
* tests : fix new llama_backend API
* [mpi] use MPI_INT32_T
* mpi : factor out recv / send in functions and reuse
* mpi : extend API to allow usage with outer backends (e.g. Metal)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : make model stateless and context stateful
* llama : minor cleanup
* llama : update internal API declaration
* Apply suggestions from code review
fix style
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Missing model memory release
* Fix style
* Add deprecated warning for public API function llama_init_from_file
* Update public API use cases: move away from deprecated llama_init_from_file
* Deprecate public API function llama_apply_lora_from_file
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Replaced static initialization of complex objects with a initialization on first use. This prevents an undefined behavior on program run, for example, crash in Release build, works in Debug build
* replaced use of auto with exact type to avoid using -std=c++14
* Made the assessors functions for static maps be static const
- main -> examples
- utils -> examples (renamed to "common")
- quantize -> examples
- separate tools for "perplexity" and "embedding"
Hope I didn't break something !
* Major refactoring - introduce C-style API
* Clean up
* Add <cassert>
* Add <iterator>
* Add <algorithm> ....
* Fix timing reporting and accumulation
* Measure eval time only for single-token calls
* Change llama_tokenize return meaning
* Add test-tokenizer-0 to do a few tokenizations - feel free to expand
* Added option to convert-pth-to-ggml.py script to dump just the vocabulary
* Added ./models/ggml-vocab.bin containing just LLaMA vocab data (used for tests)
* Added utility to load vocabulary file from previous point (temporary implementation)
* Avoid using std::string_view and drop back to C++11 (hope I didn't break something)
* Rename gpt_vocab -> llama_vocab
* All CMake binaries go into ./bin/ now