dspy gepa

import%20marimo%0A%0A__generated_with%20%3D%20%220.17.6%22%0Aapp%20%3D%20marimo.App(width%3D%22medium%22)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%20DSPy%20Prompt%20Optimization%20Lab%0A%0A%20%20%20%20There%20are%20several%20tutorials%20on%20prompt%20optimization%20with%20DSPy.%20The%20best%20way%20to%20learn%20from%20a%20tutorial%20is%20to%20not%20follow%20i%0A%20%20%20%20exactly%20as%20written%2C%20but%20to%20adapt%20it%20to%20try%20something%20slightly%20different.%0A%0A%20%20%20%20Today%2C%20we%20will%20read%20the%20%5BGEPA%20for%20AIME%5D(https%3A%2F%2Fdspy.ai%2Ftutorials%2Fgepa_aime%2F)%20tutorial%2C%20but%20adapt%20it%20to%20work%20with%20a%0A%20%20%20%20different%20dataset%20of%20math%20problems%20and%20a%20different%20set%20of%20models.%20We%20will%20also%20try%20different%20optimization%20metrics%0A%20%20%20%20that%20go%20beyond%20what's%20presented%20in%20the%20tutorial.%0A%0A%20%20%20%20You%20can%20download%20this%20notebook%20to%20run%20in%20Marimo%20(the%20*Run%20or%20Edit*%20link%20in%20the%20top-right%20corner).%20It%20will%20*not*%20run%0A%20%20%20%20on%20the%20web%20with%20WebAssembly.%20Alternatively%2C%20you%20can%20copy%20the%20code%20to%20a%20Jupyter%20notebook%20or%20Python.%20We%20have%20not%20provided%0A%20%20%20%20much%20code%2C%20and%20you%20won't%20need%20to%20write%20all%20that%20much%20code.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20import%20marimo%20as%20mo%0A%20%20%20%20import%20dspy%0A%20%20%20%20import%20datasets%0A%20%20%20%20return%20datasets%2C%20dspy%2C%20mo%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Dataset%3A%20Math%20Word%20Problems%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20We%20will%20use%20DSPy%20to%20solve%20the%20math%20word%20problems%20from%20GSM8K%2C%20as%20we%20have%20done%20several%20times%20before.%0A%20%20%20%20In%20GSM8K%2C%20the%20answer%20that%20accompanies%20each%20problem%20is%20formatted%20as%20%60%7Breasoning%7D%20%23%23%23%23%20%7Bnumber_answer%7D%60%0A%20%20%20%20as%20shown%20below.%20In%20a%20few%20cases%2C%20the%20number%20has%20commas%2C%20e.g.%2C%20%601%2C200%60.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(datasets)%3A%0A%20%20%20%20gsm8k%20%3D%20datasets.load_dataset(%22openai%2Fgsm8k%22%2C%20%22main%22)%0A%20%20%20%20gsm8k%5B%22test%22%5D%5B1%5D%0A%20%20%20%20return%20(gsm8k%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Our%20first%20preprocessing%20step%20will%20be%20to%20split%20the%20answer%20into%20reasoning%20and%20a%20numeric%20answer.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(gsm8k)%3A%0A%20%20%20%20def%20_process_gsm8k_item(item)%3A%0A%20%20%20%20%20%20%20%20reasoning%2C%20answer%20%3D%20item%5B%22answer%22%5D.split(%22%23%23%23%23%22%2C%20maxsplit%3D1)%0A%20%20%20%20%20%20%20%20return%20%7B%0A%20%20%20%20%20%20%20%20%20%20%20%20%22question%22%3A%20item%5B%22question%22%5D%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22reasoning%22%3A%20reasoning%2C%0A%20%20%20%20%20%20%20%20%20%20%20%20%22answer%22%3A%20float(answer.strip().replace(%22%2C%22%2C%20%22%22))%0A%20%20%20%20%20%20%20%20%7D%0A%0A%20%20%20%20cleaned_gsm8k%20%3D%20gsm8k.map(_process_gsm8k_item)%0A%20%20%20%20return%20(cleaned_gsm8k%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20DSPy%20prompt%20optimizers%20require%20a%20traditional%20train%2C%20test%2C%20and%20validation%20split%2C%20so%20we%20split%20the%0A%20%20%20%20GSM8K%20test%20set%20into%20a%20test%20set%20and%20validation%20set%20below.%20We%20also%20format%20the%20GSM8K%20problems%0A%20%20%20%20as%20%5Bdspy.Example%5D(https%3A%2F%2Fdspy.ai%2Fapi%2Fprimitives%2FExample%2F%3Fh%3Dexample%23dspy.Example)%20objects%2C%20which%20DSPy%0A%20%20%20%20requires.%20We%20can%20think%20of%20%60dspy.Example%60%20as%20a%20dictionary%20where%20some%20fields%20are%20clearly%20marked%20as%0A%20%20%20%20inputs%20for%20inference.%20This%20allows%20us%20to%20pass%20a%20complete%20example%20--%20with%20metadata%20and%20solutions%20--%20to%0A%20%20%20%20a%20DSPy%20program%20and%20know%20that%20the%20program%20will%20only%20%22see%22%20the%20input%20fields.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(cleaned_gsm8k%2C%20dspy)%3A%0A%20%20%20%20train_set%20%3D%20%5B%20dspy.Example(**x).with_inputs(%22question%22)%20for%20x%20in%20cleaned_gsm8k%5B%22train%22%5D%20%5D%0A%20%20%20%20test_set%20%3D%20%5B%20dspy.Example(**x).with_inputs(%22question%22)%20for%20x%20in%20cleaned_gsm8k%5B%22test%22%5D.select(range(50))%20%5D%0A%20%20%20%20val_set%20%3D%20%5B%20dspy.Example(**x).with_inputs(%22question%22)%20for%20x%20in%20cleaned_gsm8k%5B%22test%22%5D.select(range(50%2C%20100))%20%5D%0A%20%20%20%20return%20(test_set%2C)%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Models%3A%20SmolLM2%20and%20Claude%20Haiku%204.5%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20DSPy%20uses%20LiteLLM%20under%20the%20hood%2C%20and%20we%20use%20it%20with%20many%20different%20models.%20The%0A%20%20%20%20%5BGEPA%20for%20AIME%20tutorial%5D(https%3A%2F%2Fdspy.ai%2Ftutorials%2Fgepa_aime%2F)%20that%20we%20are%20following%20uses%0A%20%20%20%20GPT-4.1-mini%20and%20GPT-5.%20We%20are%20going%20to%20use%20a%20pair%20of%20weaker%20models%0A%20%20%20%20%5BSmolLM2%201.7B%20Instruct%5D(https%3A%2F%2Fhuggingface.co%2FHuggingFaceTB%2FSmolLM2-1.7B-Instruct)%20and%0A%20%20%20%20%5BClaude%20Haiku%204.5%5D(https%3A%2F%2Fwww.anthropic.com%2Fnews%2Fclaude-haiku-4-5).%0A%20%20%20%20We%20are%20hosting%20access%20to%20both%20models.%20The%20code%20below%20constructs%20%60dspy.LM%60%20objects%20to%0A%20%20%20%20reference%20the%20models%20and%20sends%20a%20chat%20query%20to%20each%20one.%20Both%20should%20work%20and%20return%0A%20%20%20%20a%20response.%20Let%20us%20know%20if%20you%20have%20trouble.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(dspy)%3A%0A%20%20%20%20smollm2%20%3D%20dspy.LM(%0A%20%20%20%20%20%20%20%20model%3Df%22openai%2Fsmollm2%22%2C%0A%20%20%20%20%20%20%20%20api_base%3D%22https%3A%2F%2Fcloud.guha-anderson.com%2Fv1%22%2C%0A%20%20%20%20%20%20%20%20api_key%3D%22dummy%22%2C%0A%20%20%20%20%20%20%20%20model_type%3D%22chat%22%2C%0A%20%20%20%20%20%20%20%20max_tokens%3D2048%2C%0A%20%20%20%20%20%20%20%20temperature%3D0.2%2C%0A%20%20%20%20)%0A%0A%20%20%20%20smollm2(%22What%20is%20your%20name%3F%22)%0A%20%20%20%20return%20(smollm2%2C)%0A%0A%0A%40app.cell%0Adef%20_(dspy)%3A%0A%20%20%20%20haiku%20%3D%20dspy.LM(%0A%20%20%20%20%20%20%20%20model%3Df%22openai%2Fhaiku%22%2C%0A%20%20%20%20%20%20%20%20api_base%3D%22https%3A%2F%2Fcloud.guha-anderson.com%2Fv1%22%2C%0A%20%20%20%20%20%20%20%20api_key%3D%22dummy%22%2C%0A%20%20%20%20%20%20%20%20model_type%3D%22chat%22%2C%0A%20%20%20%20%20%20%20%20temperature%3D0.7%2C%0A%20%20%20%20)%0A%20%20%20%20haiku(%22What%20is%20your%20name%3F%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Instead%20of%20calling%20the%20model%20directly%2C%20we%20can%20define%20a%20*DSPy%20signature*%20which%20specifies%20the%20types%20of%20inputs%20and%20outputs.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(dspy%2C%20smollm2)%3A%0A%20%20%20%20simple_solver_sig%20%3D%20dspy.Signature(%22question%3Astr%20-%3E%20answer%3Afloat%22%2C%20instructions%3D%22Solve%20the%20given%20problem.%22)%0A%20%20%20%20simple_solver%20%3D%20dspy.ChainOfThought(simple_solver_sig)%0A%20%20%20%20simple_solver.set_lm(smollm2)%0A%20%20%20%20return%20(simple_solver%2C)%0A%0A%0A%40app.cell%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20We%20can%20apply%20the%20%60simple_solver%60%20function%20to%20an%20example%20from%20the%20test%20set.%20Each%20test%20set%20example%20has%20the%20answer%2C%20but%0A%20%20%20%20the%20function%20will%20only%20read%20the%20input%20fields.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_(simple_solver%2C%20test_set)%3A%0A%20%20%20%20simple_solver(**test_set%5B0%5D)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Task%201%3A%20Evaluate%20the%20Unoptimized%20Prompt%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Following%20the%20tutorial%2C%20you%20should%20now%20evaluate%20%60simple_solver%60%20on%20the%20entire%20test%20set%20using%20%60dspy.Evaluate%60.%0A%20%20%20%20All%20you%20need%20to%20do%20is%20write%20a%20*metric*%20function%20and%20call%20%60dspy.Evaluate%60%20with%20the%20right%20arguments.%0A%20%20%20%20We%20recommend%20setting%20%60num_threads%3D50%60%20to%20issue%20all%20queries%20concurrently.%20You%20can%20expect%20to%20find%0A%20%20%20%20%60simple_solver%60%20correctly%20solves%20approximately%2010%20out%20of%2050%20problems%20with%20a%20robust%20metric.%0A%0A%20%20%20%20You%20will%20see%20a%20~3-4%20warnings%20such%20as%20these%3A%0A%0A%20%20%20%201.%20*LM%20response%20was%20truncated%20due%20to%20exceeding%20max_tokens%3D2048.%20You%20can%20inspect%20the%20latest%20LM%20interactions%20with%20%60dspy.inspect_history()%60.%20To%20avoid%20truncation%2C%20consider%20passing%20a%20larger%20max_tokens%20when%20setting%20up%20dspy.LM.%20You%20may%20also%20consider%20increasing%20the%20temperature%20(currently%200.2)%20%20if%20the%20reason%20for%20truncation%20is%20repetition.*%0A%20%20%20%202.%20*Failed%20to%20use%20structured%20output%20format%2C%20falling%20back%20to%20JSON%20mode.*%0A%0A%20%20%20%20These%20are%20cases%20where%20SmolLM2%20produces%20a%20degenerate%20response%20that%20DSPy%20cannot%20parse.%20Be%20assured%20that%20allowing%20the%20model%0A%20%20%20%20to%20produce%20an%20even%20longer%20response%20will%20not%20help.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20%23%23%23%20FILL%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Task%202%3A%20Reflective%20Prompt%20Optimization%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Following%20the%20tutorial%2C%20use%20GEPA%20to%20optimize%20the%20prompt.%20You%20will%20need%20to%20write%20a%20new%20metric%20that%20produces%20feedback.%0A%0A%20%20%20%20You%20can%20run%20GEPA%20with%20this%20code%3A%0A%0A%20%20%20%20%60%60%60python%0A%20%20%20%20optimizer%20%3D%20dspy.GEPA(%0A%20%20%20%20%20%20%20%20metric%3Dmetric_with_feedback%2C%0A%20%20%20%20%20%20%20%20num_threads%3D50%2C%0A%20%20%20%20%20%20%20%20max_metric_calls%3D250%2C%0A%20%20%20%20%20%20%20%20track_stats%3DTrue%2C%0A%20%20%20%20%20%20%20%20reflection_minibatch_size%3D3%2C%0A%20%20%20%20%20%20%20%20reflection_lm%3Dhaiku%0A%20%20%20%20)%0A%20%20%20%20optimized_solver%20%3D%20optimizer.compile(simple_solver%2C%20trainset%3Dtrain_set%2C%20valset%3Dval_set)%0A%20%20%20%20%60%60%60%0A%0A%20%20%20%20With%20%60max_metric_calls%3D250%60%2C%20it%20takes%20about%20~2%20mins%20to%20run.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20%23%23%23%20FILL%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20Notice%20that%20the%20GEPA%20algorithm%20ran%20on%20%60val_set%60%20and%20%60train_set%60%2C%20and%20did%20not%20have%20access%20to%20%60test_set%60.%20You%20should%20do%20a%20final%0A%20%20%20%20evaluation%20on%20%60test_set%60%20to%20ensure%20that%20%60optimized_program%60%20generalizes.%20I%20got%2030%25%20accuracy%20(up%20from%2020%25)%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20%23%23%23%20FILL%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20We%20manually%20run%20several%20training%20steps%20below.%20We%20can%20experiment%20with%20different%20teacher%20models.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20%23%23%20Task%203%3A%20Metric%20Variations%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell(hide_code%3DTrue)%0Adef%20_(mo)%3A%0A%20%20%20%20mo.md(r%22%22%22%0A%20%20%20%20The%20problems%20in%20our%20dataset%20have%20include%20answers%20and%20reasoning.%20But%2C%20for%20real%20problems%2C%20it%20can%20be%20very%20hard%20or%0A%20%20%20%20expensive%20to%20gather%20this%20information.%20We%20recommend%20trying%20two%20variables%20of%20the%20metric%3A%0A%0A%20%20%20%201.%20Define%20a%20new%20metric%20function%20that%20does%20not%20use%20the%20reasoning%20from%20the%20dataset.%20Instead%2C%20*use%20Haiku%20itself%20to%0A%20%20%20%20%20%20%20produce%20the%20reasoning%20trace.*%20%20You%20can%20use%20DSPy%20to%20write%20a%20program%20asks%20a%20model%20to%20explain%20why%20the%20answer%0A%20%20%20%20%20%20%20to%20a%20question%20is%20wrong.%0A%0A%20%20%20%202.%20Define%20a%20new%20metric%20function%20that%20does%20not%20use%20the%20reasoning%20trace%20or%20the%20answer%20from%20the%20data.%20Instead%2C%0A%20%20%20%20%20%20%20ask%20Haiku%20if%20the%20answer%20is%20correct%20and%20to%20explain%20why.%0A%0A%0A%20%20%20%20You%20will%20find%20that%20both%20of%20these%20metrics%20are%20simpler%20to%20implement%20in%20code%20than%20the%20metric%20you%20wrote%20earlier%0A%20%20%20%20that%20uses%20the%20feedback%20and%20answers%20from%20the%20dataset.%0A%20%20%20%20%22%22%22)%0A%20%20%20%20return%0A%0A%0A%40app.cell%0Adef%20_()%3A%0A%20%20%20%20%23%23%23%20FILL%0A%20%20%20%20return%0A%0A%0Aif%20__name__%20%3D%3D%20%22__main__%22%3A%0A%20%20%20%20app.run()%0A