Use tokenizer.vocab_size() instead of hardcoding 32000 in convert-pth-to-ggml.py (#142)

There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.
2024-11-09 15:29:43 +00:00 · 2023-03-15 12:37:50 -07:00 · 2023-03-15 12:37:50 -07:00 · 956dfda8ad
commit 956dfda8ad
parent 113e685d18
1 changed files with 1 additions and 1 deletions
--- a/convert-pth-to-ggml.py
+++ b/convert-pth-to-ggml.py
@ -99,7 +99,7 @@ for p in range(n_parts):
    fout.write(struct.pack("i", ftype))

    # Is this correct??
-    for i in range(32000):
+    for i in range(tokenizer.vocab_size()):
        if tokenizer.is_unknown(i):
            # "<unk>" token (translated as ??)
            text = " \u2047 ".encode("utf-8")