Hi, appreciate your valuable contribution!
I'm running my own model by adding a new model with newly registered prompt template. When I'm running the evaluation, I found that there are some mismatch about the amount of test questions.
Here are my evaluation script:
python -m lcb_runner.runner.main --model deepseek-coder-v1.5-instruct-7b-r2c \
--scenario codegeneration \
--local_model_path ../experiments/deepseek-coder-v1.5-ins.7b.r2c.sft_ps_test_case.iter2.dpo.H100.dp8.v1.0.s42/checkpoint-2400/ \
--release_version "release_v2" --not_fast --n 1 --evaluate --stop "<|EOT|>" --max_tokens 4096 --temperature 0.0
When I'm running the model, I noticed that the tqdm bar shows 400 questions, but I find there should be 450 questions from 2023-09-01 to 2024-09-01. Besides, after I run
python -m lcb_runner.evaluation.compute_scores --eval_all_file output/DeepSeekR2C/Scenario.codegeneration_1_0.0_eval_all.json --start_date 2023-09-01 --end_date 2024-09-01
I get the following outputs:
238
Pass@1 = 0.24369747899159663
Easy Pass@1 = 0.5529411764705883
Medium Pass@1 = 0.11224489795918367
Hard Pass@1 = 0.0
Pass@5 = 1.0
Easy Pass@5 = 1.0
Medium Pass@5 = 1.0
Hard Pass@5 = 1.0
Pass@10 = 1.0
Easy Pass@10 = 1.0
Medium Pass@10 = 1.0
Hard Pass@10 = 1.0
Pass@25 = 1.0
Easy Pass@25 = 1.0
Medium Pass@25 = 1.0
Hard Pass@25 = 1.0
Pass@50 = 1.0
Easy Pass@50 = 1.0
Medium Pass@50 = 1.0
Hard Pass@50 = 1.0
Pass@100 = 1.0
Easy Pass@100 = 1.0
Medium Pass@100 = 1.0
Hard Pass@100 = 1.0
Pass@150 = 1.0
Easy Pass@150 = 1.0
Medium Pass@150 = 1.0
Hard Pass@150 = 1.0
Pass@200 = 1.0
Easy Pass@200 = 1.0
Medium Pass@200 = 1.0
Hard Pass@200 = 1.0
Pass@1: 0.24369747899159663
Easy Pass@1: 0.5529411764705883
Medium Pass@1: 0.11224489795918367
Hard Pass@1: 0.0
Seems that there are only 238 rows of results. Could you explain a little bit about this? Is there any mistake from my side?
BTW,
here is my registered model following readme:
LanguageModel(
"deepseek-coder-v1.5-instruct-7b-r2c",
"DeepSeekR2C",
LMStyle.DeepSeekR2C,
datetime(2023, 1, 1),
link="https://huggingface.co/chitanda/deepseek-coder-v1.5-instruct-7b-r2c",
)
Thank you for your help very much!
Hi, appreciate your valuable contribution!
I'm running my own model by adding a new model with newly registered prompt template. When I'm running the evaluation, I found that there are some mismatch about the amount of test questions.
Here are my evaluation script:
When I'm running the model, I noticed that the tqdm bar shows
400questions, but I find there should be 450 questions from2023-09-01to2024-09-01. Besides, after I runI get the following outputs:
Seems that there are only 238 rows of results. Could you explain a little bit about this? Is there any mistake from my side?
BTW,
here is my registered model following readme:
Thank you for your help very much!