Understanding LLaMA 2: Training Details and Performance Enhancements
Pretraining Data and Batch Size
LLaMA 2 models are trained solely on pretraining data, with token counts indicating the size of this data. All models leverage a global batch size of 4M tokens during training.
Model-Specific Enhancements
* The larger 70B model employs Grouped-Query Attention (GQA) for improved performance. * During testing, the LLaMA-2 70b q3_K_S model at 32k context utilized arguments tailored for 16k context size.
Enhanced Context Length and Model Sizes
Compared to LLaMA 1, LLaMA 2 models boast twice the context length. All three available sizes (7B, 13B, and 70B) are trained on 2 trillion tokens.
Fine-Tuning Options and GPU Requirements
LLaMA 2 can be fine-tuned using Amazon SageMaker. The vocab_size parameter is optional, with a default value of 32000. For optimal performance of LLaMA-65B and 70B, GPUs with at least 40GB VRAM are recommended.
Komentar