AI - performance testing LLMs using LM Studio

Windows 11 Intel i9 laptop with 64Gb RAM and nVidia 8Gb VRAM GPU

LM Studio v. 0.2.12 Jan 2024
- temp 0.5, n_predict 1024, top_k 50, repeat_penalty 1.1, min_p 0.05, cpu threads 4, n_batch 512, context length 32768 tokens (dropping to 4096 seems to speed up tokens/sec by 10% but otherwise no obvious benefit in speed), experts to use 2 (for mixtral)
system prompt requiring tree of knowledge analysis etc

model	model size	GPU layers	Load All into RAM	time to 1st token (secs)	gen t (secs)	tokens/sec	RAM used	response quality
Mistral 7 q6 K	5.94Gb	all 32	No	35, 11	24, 37	12.7, 11.8	11Gb	OK - only just
Mixtral 8×7 q2	15.64Gb	14 of 32	No	60, 36	76, 103	4.74, 4.5	20Gb	OK - only just
Mixtral 8×7 q3 K_M	20.36Gb	9 of 32	No	632, 78, 121	105, 96, 91	3.8	26Gb	excellent
Mixtral 8×7 q3 K_M+ RAM	20.36Gb	9 of 32	YES	826, 146, 91	105, 77, 77	3.8, 3.86, 3.89	25Gb	excellent
Mixtral 8×7 q4 K_M	26.44Gb	9 of 32	No	278, 236	37	4	33Gb	excellent
Mixtral 8×7 q4 K_M + RAM	26.44Gb	9 of 32	YES	860, 151	54, 92	4, 3.8	32Gb	excellent