llama.cpp

mirror of https://git.adityakumar.xyz/llama.cpp.git synced 2024-11-14 08:59:45 +00:00

Author	SHA1	Message	Date
Georgi Gerganov	2d5db48371	ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508 ) * ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0 * llama : bump LLAMA_FILE_VERSION to 3 * cuda : update Q4 and Q8 dequantize kernels * ggml : fix AVX dot products * readme : update performance table + hot topics	2023-05-19 22:17:18 +03:00
David Kennedy	79e3efb0e9	readme : adds WizardLM to the list of supported models (#1485 )	2023-05-19 20:16:30 +03:00
Georgi Gerganov	cdd5350892	readme : update Q4_0 perplexities I think these were affected by the removal of the `round` during quantization	2023-05-13 09:12:44 +03:00
Rinne	089b1c93ba	readme : add C#/.NET bindings repo (#1409 )	2023-05-12 08:39:40 +03:00
Georgi Gerganov	b9fd7eee57	ggml : remove bit shuffling (#1405 ) * ggml : remove Q4_0 bit shufling (ARM NEON) * ggml : remove Q4_1 bit shuffling (ARM NEON + reference) * ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON) * ggml : remove Q4_2 bit shuffling (WIP, BROKEN) * ggml : remove Q5_0 bit shuffling (ARM NEON) * ggml : 2x faster scalar implementations * ggml : remove Q5_1 bit shuffling (ARM NEON + scalar) * ggml : simplify scalar dot * ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit * ggml : fix Q4_1 quantization * ggml : update cuBLAS + normalize variable names * ggml : remove Q4_2 mode * ggml : minor formatting * ggml : fix Q5_0 quantization * scripts : add script for measuring the time per token * AVX implementations (#1370) * ggml : uniform 5th bit extraction * llama : produce error upon loading old model files * llama : fix model magic/version write * ggml : speed-up Q5_0 + Q5_1 at 4 threads * ggml : preserve old Q4 and Q5 formats * ggml : simplify Q8_1 - no need for low / high sums anymore * ggml : fix Q8_0 and Q8_1 rounding * Revert "AVX implementations (#1370)" This reverts commit 948d124837f9d287d8490f41338e0e4cceb0814f. * ggml : fix AVX2 implementation * sha : update hashes for 7B and 13B * readme : update timings + remove warning banner * llama : update v2 PR number to 1405 * ggml : fix WASM comments * ggml : back to original bit order * readme : add note that Q4 and Q5 have been changed * llama : fix return for unknown version --------- Co-authored-by: Stephan Walter <stephan@walter.name>	2023-05-12 00:23:08 +03:00
Georgi Gerganov	56551bc11f	readme : add notice about upcoming breaking change	2023-05-08 22:52:18 +03:00
AlpinDale	fe60904eef	readme : add TOC and Pygmalion instructions (#1359 )	2023-05-08 19:33:30 +03:00
Georgi Gerganov	f9a6364912	llama : require first token to be BOS (#1303 ) * llama : require first token to be BOS * scripts : add ppl-run-all.sh * perplexity : add BOS for each chunk * readme : update perplexity values after BOS fix * perplexity : add clarifying comments	2023-05-08 17:41:54 +03:00
Johannes Gäßler	1f48b0abcf	Documented CUDA reproducibility, added warning (#1346 )	2023-05-08 02:42:01 +02:00
DaniAndTheWeb	173d0e6419	makefile: automatic Arch Linux detection (#1332 ) This commit is a port of a detection method used in koboldcpp's Makefile in order to automatically set the -lcblas option on Arch Linux	2023-05-05 23:57:14 +02:00
Pavol Rusnak	921dcee00a	readme: add missing info (#1324 )	2023-05-05 16:43:36 +02:00
44670	360cfe5bec	readme : add OpenBuddy link (#1321 )	2023-05-04 19:33:31 +03:00
Georgi Gerganov	bca9ad938a	minor : fix whitespaces (#1302 )	2023-05-03 20:09:42 +03:00
KASR	b0c71c7b6d	scripts : platform independent script to verify sha256 checksums (#1203 ) * python script to verify the checksum of the llama models Added Python script for verifying SHA256 checksums of files in a directory, which can run on multiple platforms. Improved the formatting of the output results for better readability. * Update README.md update to the readme for improved readability and to explain the usage of the python checksum verification script * update the verification script I've extended the script based on suggestions by @prusnak The script now checks the available RAM, is there is enough to check the file at once it will do so. If not the file is read in chunks. * minor improvment small change so that the available ram is checked and not the total ram * remove the part of the code that reads the file at once if enough ram is available based on suggestions from @prusnak i removed the part of the code that checks whether the user had enough ram to read the entire model at once. the file is now always read in chunks. * Update verify-checksum-models.py quick fix to pass the git check	2023-05-03 18:31:28 +03:00
Stephan Walter	36d19a603b	Remove Q4_3 which is no better than Q5 (#1218 )	2023-04-28 23:10:43 +00:00
Georgi Gerganov	7f15c5c477	readme : update hot topics	2023-04-28 21:32:52 +03:00
Folko-Ven	78ec543733	Correcting link to w64devkit (#1214 ) Correcting link to w64devkit (change seeto to skeeto).	2023-04-28 16:22:48 +02:00
Georgi Gerganov	f9be42add0	readme : add quantization info	2023-04-26 23:24:42 +03:00
DaniAndTheWeb	ea3ad7eb60	Updating build instructions to include BLAS support (#1183 ) * Updated build information First update to the build instructions to include BLAS. * Update README.md * Update information about BLAS * Better BLAS explanation Adding a clearer BLAS explanation and adding a link to download the CUDA toolkit. * Better BLAS explanation * BLAS for Mac Specifying that BLAS is already supported on Macs using the Accelerate Framework. * Clarify the effect of BLAS * Windows Make instructions Added the instructions to build with Make on Windows * Fixing typo * Fix trailing whitespace	2023-04-26 22:03:03 +02:00
Pavol Rusnak	859fee6dfb	quantize : use `map` to assign quantization type from `string` (#1191 ) instead of `int` (while `int` option still being supported) This allows the following usage: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin q4_0` instead of: `./quantize ggml-model-f16.bin ggml-model-q4_0.bin 2`	2023-04-26 18:43:27 +02:00
mgroeber9110	9b0a4d4214	examples/main README improvements and some light refactoring (#1131 )	2023-04-24 15:45:32 +00:00
Pavol Rusnak	c6524f46eb	readme : update gpt4all instructions (#980 )	2023-04-23 10:21:26 +02:00
CRD716	834695fe3a	Minor: Readme fixed grammar, spelling, and misc updates (#1071 )	2023-04-19 19:52:14 +00:00
Georgi Gerganov	7cd5c4a3e9	readme : add warning about Q4_2 and Q4_3	2023-04-19 19:07:54 +03:00
Georgi Gerganov	7faa7460f0	readme : update hot topics about new LoRA functionality	2023-04-18 20:10:26 +03:00
Atsushi Tatsuma	e9298af389	readme : add Ruby bindings (#1029 )	2023-04-17 22:34:35 +03:00
comex	723dac55fa	py : new conversion script (#545 ) Current status: Working, except for the latest GPTQ-for-LLaMa format that includes `g_idx`. This turns out to require changes to GGML, so for now it only works if you use the `--outtype` option to dequantize it back to f16 (which is pointless except for debugging). I also included some cleanup for the C++ code. This script is meant to replace all the existing conversion scripts (including the ones that convert from older GGML formats), while also adding support for some new formats. Specifically, I've tested with: - [x] `LLaMA` (original) - [x] `llama-65b-4bit` - [x] `alpaca-native` - [x] `alpaca-native-4bit` - [x] LLaMA converted to 'transformers' format using `convert_llama_weights_to_hf.py` - [x] `alpaca-native` quantized with `--true-sequential --act-order --groupsize 128` (dequantized only) - [x] same as above plus `--save_safetensors` - [x] GPT4All - [x] stock unversioned ggml - [x] ggmh There's enough overlap in the logic needed to handle these different cases that it seemed best to move to a single script. I haven't tried this with Alpaca-LoRA because I don't know where to find it. Useful features: - Uses multiple threads for a speedup in some cases (though the Python GIL limits the gain, and sometimes it's disk-bound anyway). - Combines split models into a single file (both the intra-tensor split of the original and the inter-tensor split of 'transformers' format files). Single files are more convenient to work with and more friendly to future changes to use memory mapping on the C++ side. To accomplish this without increasing memory requirements, it has some custom loading code which avoids loading whole input files into memory at once. - Because of the custom loading code, it no longer depends in PyTorch, which might make installing dependencies slightly easier or faster... although it still depends on NumPy and sentencepiece, so I don't know if there's any meaningful difference. In any case, I also added a requirements.txt file to lock the dependency versions in case of any future breaking changes. - Type annotations checked with mypy. - Some attempts to be extra user-friendly: - The script tries to be forgiving with arguments, e.g. you can specify either the model file itself or the directory containing it. - The script doesn't depend on config.json / params.json, just in case the user downloaded files individually and doesn't have those handy. But you still need tokenizer.model and, for Alpaca, added_tokens.json. - The script tries to give a helpful error message if added_tokens.json is missing.	2023-04-14 10:03:03 +03:00
CRD716	ec29272175	readme : remove python 3.10 warning (#929 )	2023-04-13 16:59:53 +03:00
Genkagaku.GPT	7e941b95eb	readme : llama node binding (#911 ) * chore: add nodejs binding * chore: add nodejs binding	2023-04-13 16:54:27 +03:00
Judd	4579af95e8	zig : update build.zig (#872 ) * update * update readme * minimize the changes. --------- Co-authored-by: zjli2019 <zhengji.li@ingchips.com>	2023-04-13 16:43:22 +03:00
Georgi Gerganov	f76cb3a34d	readme : change "GPU support" link to discussion	2023-04-12 14:48:57 +03:00
Georgi Gerganov	782438070f	readme : update hot topics with link to "GPU support" issue	2023-04-12 14:31:12 +03:00
Nicolai Weitkemper	4dbbd40750	readme: link to sha256sums file (#902 ) This is to emphasize that these do not need to be obtained from elsewhere.	2023-04-12 08:46:20 +02:00
Pavol Rusnak	8b679987cd	Fix whitespace, add .editorconfig, add GitHub workflow (#883 )	2023-04-11 19:45:44 +00:00
qouoq	a0caa34b16	Add BAIR's Koala to supported models (#877 )	2023-04-10 22:41:53 +02:00
Pavol Rusnak	d2beca95dc	Make docker instructions more explicit (#785 )	2023-04-06 08:56:58 +02:00
Georgi Gerganov	3416298929	Update README.md	2023-04-05 19:54:30 +03:00
Georgi Gerganov	8d10406d6e	readme : change logo + add bindings + add uis + add wiki	2023-04-05 18:56:20 +03:00
Adithya Balaji	594cc95fab	readme : update with CMake and windows example (#748 ) * README: Update with CMake and windows example * README: update with code-review for cmake build	2023-04-05 17:36:12 +03:00
Thatcher Chamberlin	d8d4e865cd	Add a missing step to the gpt4all instructions (#690 ) `migrate-ggml-2023-03-30-pr613.py` is needed to get gpt4all running.	2023-04-02 12:48:57 +02:00
rimoliga	d0a7f742e7	readme: replace termux links with homepage, play store is deprecated (#680 )	2023-04-01 16:57:30 +02:00
Pavol Rusnak	9733104be5	drop quantize.py (now that models are using a single file)	2023-03-31 01:07:32 +02:00
Georgi Gerganov	3df890aef4	readme : update supported models	2023-03-30 22:31:54 +03:00
Georgi Gerganov	b467702b87	readme : fix typos	2023-03-29 19:38:31 +03:00
Georgi Gerganov	516d88e75c	readme : add GPT4All instructions (close #588 )	2023-03-29 19:37:20 +03:00
Stephan Walter	b391579db9	Update README and comments for standalone perplexity tool (#525 )	2023-03-26 16:14:01 +03:00
Georgi Gerganov	348d6926ee	Add logo to README.md	2023-03-26 10:20:49 +03:00
Georgi Gerganov	55ad42af84	Move chat scripts into "./examples"	2023-03-25 20:37:09 +02:00
Georgi Gerganov	4a7129acd2	Remove obsolete information from README	2023-03-25 16:30:32 +02:00
Gary Mulder	f4f5362edb	Update README.md (#444 ) Added explicit bolded instructions clarifying that people need to request access to models from Facebook and never through through this repo.	2023-03-24 15:23:09 +00:00

1 2 3

111 commits