llama.cpp/utils.h

// Various helper functions and utilities

#pragma once

#include "llama.h"

#include <string>
#include <vector>
#include <random>
#include <thread>

//
// CLI argument parsing
//

struct gpt_params {
    int32_t seed          = -1;  // RNG seed
    int32_t n_threads     = std::min(4, (int32_t) std::thread::hardware_concurrency());
    int32_t n_predict     = 128; // new tokens to predict
    int32_t repeat_last_n = 64;  // last n tokens to penalize
    int32_t n_parts       = -1;  // amount of model parts (-1 = determine from model dimensions)
    int32_t n_ctx         = 512; //context size

    // sampling parameters
    int32_t top_k = 40;
    float   top_p = 0.95f;
    float   temp  = 0.80f;
    float   repeat_penalty  = 1.10f;

    int32_t n_batch = 8; // batch size for prompt processing

    std::string model  = "models/lamma-7B/ggml-model.bin"; // model path
    std::string prompt = "";


    std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted

    bool memory_f16        = false; // use f16 instead of f32 for memory kv
    bool random_prompt     = false; // do not randomize prompt if none provided
    bool use_color         = false; // use color to distinguish generations and inputs
    bool interactive       = false; // interactive mode

    bool embedding         = false; // get only sentence embedding
    bool interactive_start = false; // wait for user input immediately

    bool instruct          = false; // instruction mode (used for Alpaca models)
    bool ignore_eos        = false; // do not stop generating after eos
    bool perplexity        = false; // compute perplexity over the prompt
    bool use_mlock         = false; // use mlock to keep model in memory
};

bool gpt_params_parse(int argc, char ** argv, gpt_params & params);

void gpt_print_usage(int argc, char ** argv, const gpt_params & params);

std::string gpt_random_prompt(std::mt19937 & rng);

//
// Vocab utils
//

std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);
Initial release 2023-03-10 18:40:58 +00:00			`// Various helper functions and utilities`

			`#pragma once`

Introduce C-style API (#370) * Major refactoring - introduce C-style API * Clean up * Add <cassert> * Add <iterator> * Add <algorithm> .... * Fix timing reporting and accumulation * Measure eval time only for single-token calls * Change llama_tokenize return meaning 2023-03-22 05:32:36 +00:00			`#include "llama.h"`

Initial release 2023-03-10 18:40:58 +00:00			`#include <string>`
			`#include <vector>`
			`#include <random>`
			`#include <thread>`

			`//`
			`// CLI argument parsing`
			`//`

			`struct gpt_params {`
cmdline option for custom amount of model parts (--n_parts N) (#348) * cmdline option for custom amount of model parts (--n_parts N) * Update main.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-03-21 15:42:43 +00:00			`int32_t seed = -1; // RNG seed`
Change default repeat_penalty to 1.0 I feel this penalty is not really helping. Especially for the example from the README it makes results pretty bad 2023-03-21 15:32:14 +00:00			`int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());`
			`int32_t n_predict = 128; // new tokens to predict`
Add repetition penalty (#20) * Adding repeat penalization * Update utils.h * Update utils.cpp * Numeric fix Should probably still scale by temp even if penalized * Update comments, more proper application I see that numbers can go negative so a fix from a referenced commit * Minor formatting --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-03-12 09:27:42 +00:00			`int32_t repeat_last_n = 64; // last n tokens to penalize`
cmdline option for custom amount of model parts (--n_parts N) (#348) * cmdline option for custom amount of model parts (--n_parts N) * Update main.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-03-21 15:42:43 +00:00			`int32_t n_parts = -1; // amount of model parts (-1 = determine from model dimensions)`
Change default repeat_penalty to 1.0 I feel this penalty is not really helping. Especially for the example from the README it makes results pretty bad 2023-03-21 15:32:14 +00:00			`int32_t n_ctx = 512; //context size`
Default to 4 threads (#243) 2023-03-17 19:46:46 +00:00
Initial release 2023-03-10 18:40:58 +00:00			`// sampling parameters`
Add back top_k (#56) * Add back top_k * Update utils.cpp * Update utils.h --------- Co-authored-by: Bill Hamilton <bill.hamilton@shopify.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-03-12 20:23:15 +00:00			`int32_t top_k = 40;`
Initial release 2023-03-10 18:40:58 +00:00			`float top_p = 0.95f;`
Final touches 2023-03-10 19:50:46 +00:00			`float temp = 0.80f;`
Change default repeat_penalty to 1.0 I feel this penalty is not really helping. Especially for the example from the README it makes results pretty bad 2023-03-21 15:32:14 +00:00			`float repeat_penalty = 1.10f;`
Initial release 2023-03-10 18:40:58 +00:00
			`int32_t n_batch = 8; // batch size for prompt processing`

Change default repeat_penalty to 1.0 I feel this penalty is not really helping. Especially for the example from the README it makes results pretty bad 2023-03-21 15:32:14 +00:00			`std::string model = "models/lamma-7B/ggml-model.bin"; // model path`
			`std::string prompt = "";`
Add interactive mode (#61) * Initial work on interactive mode. * Improve interactive mode. Make rev. prompt optional. * Update README to explain interactive mode. * Fix OS X build 2023-03-12 21:13:28 +00:00
Add embedding mode with arg flag. Currently working (#282) * working but ugly * add arg flag, not working on embedding mode * typo * Working! Thanks to @nullhook * make params argument instead of hardcoded boolean. remove useless time check * start doing the instructions but not finished. This probably doesnt compile * Embeddings extraction support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-03-24 15:05:13 +00:00
Change default repeat_penalty to 1.0 I feel this penalty is not really helping. Especially for the example from the README it makes results pretty bad 2023-03-21 15:32:14 +00:00			`std::vector<std::string> antiprompt; // string upon seeing which more user input is prompted`
Add interactive mode (#61) * Initial work on interactive mode. * Improve interactive mode. Make rev. prompt optional. * Update README to explain interactive mode. * Fix OS X build 2023-03-12 21:13:28 +00:00
Change default repeat_penalty to 1.0 I feel this penalty is not really helping. Especially for the example from the README it makes results pretty bad 2023-03-21 15:32:14 +00:00			`bool memory_f16 = false; // use f16 instead of f32 for memory kv`
			`bool random_prompt = false; // do not randomize prompt if none provided`
			`bool use_color = false; // use color to distinguish generations and inputs`
			`bool interactive = false; // interactive mode`
Add embedding mode with arg flag. Currently working (#282) * working but ugly * add arg flag, not working on embedding mode * typo * Working! Thanks to @nullhook * make params argument instead of hardcoded boolean. remove useless time check * start doing the instructions but not finished. This probably doesnt compile * Embeddings extraction support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-03-24 15:05:13 +00:00
			`bool embedding = false; // get only sentence embedding`
Don't force immediate interactive without `-i` (#354) * Don't force immediate interactive without -i Sometimes we might want to use a reverse prompt but we want to let the model generate tokens right after the initial prompt. So we don't force user input mode if the -i flag wasn't specified and instead let it run until we encounter the reverse prompt. This gives use some more flexibility, since it doesn't force the user to enter a newline if they want to let the model generate text right after the initial prompt and only be asked for input if the reverse prompt is encountered. The `--interactive-first` flag is reintroduced to force the old behavior. `-r` behaves like `-i` plus introduces a reverse prompt (it can be specified more than once). * Update help output. --------- Co-authored-by: Johnman <tjohnman@github> 2023-03-22 17:16:35 +00:00			`bool interactive_start = false; // wait for user input immediately`
Add embedding mode with arg flag. Currently working (#282) * working but ugly * add arg flag, not working on embedding mode * typo * Working! Thanks to @nullhook * make params argument instead of hardcoded boolean. remove useless time check * start doing the instructions but not finished. This probably doesnt compile * Embeddings extraction support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-03-24 15:05:13 +00:00
Change default repeat_penalty to 1.0 I feel this penalty is not really helping. Especially for the example from the README it makes results pretty bad 2023-03-21 15:32:14 +00:00			`bool instruct = false; // instruction mode (used for Alpaca models)`
			`bool ignore_eos = false; // do not stop generating after eos`
Compute perplexity over prompt (#270) * Compute perplexity over prompt * More accurate perplexity calculation - over all logits in the context window (so 512x more tokens!) * Output all perplexitiies * Add timing/ETA 2023-03-21 16:27:42 +00:00			`bool perplexity = false; // compute perplexity over the prompt`
Support calling mlock() on loaded model data on Linux and macOS (#453) * Support calling mlock() on loaded model data on Linux and macOS This is enabled by a new --mlock command line option. Using mlock() disables swapping and memory compression for the model data. Doing so can be useful on systems where the model takes up a large fraction of system RAM. In my experience, macOS is quite eager to start compressing llama.cpp's memory, which then makes it halt for a few seconds while it decompresses, even with a model that uses "only" 25GB out of 32GB. Of course, this comes at the cost of forcing the system to swap or compress other processes' memory instead, so it needs to be used with care and shouldn't be enabled by default. In theory it should be possible to support this on Windows as well using VirtualLock(), but I'm not much of a Windows user. * Update llama.cpp --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> 2023-03-24 15:19:05 +00:00			`bool use_mlock = false; // use mlock to keep model in memory`
Initial release 2023-03-10 18:40:58 +00:00			`};`

			`bool gpt_params_parse(int argc, char ** argv, gpt_params & params);`

			`void gpt_print_usage(int argc, char ** argv, const gpt_params & params);`

			`std::string gpt_random_prompt(std::mt19937 & rng);`

			`//`
			`// Vocab utils`
			`//`

Introduce C-style API (#370) * Major refactoring - introduce C-style API * Clean up * Add <cassert> * Add <iterator> * Add <algorithm> .... * Fix timing reporting and accumulation * Measure eval time only for single-token calls * Change llama_tokenize return meaning 2023-03-22 05:32:36 +00:00			`std::vector<llama_token> llama_tokenize(struct llama_context * ctx, const std::string & text, bool add_bos);`