feat: Proposed SIMBAUQ Sampling Strategy#785
feat: Proposed SIMBAUQ Sampling Strategy#785radum2275 wants to merge 23 commits intogenerative-computing:mainfrom
Conversation
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
planetf1
left a comment
There was a problem hiding this comment.
Also noticed we don't export SOFAISamplingStrategy in all - not an issue from this PR, but observed
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
I made all the required changes. I also replaced the RITS backend in my example with the ollama one. However, I need some guidance with the following: the "classifier" confidence estimation method we developed requires a probabilisitic (skelearn) classifier, which we either receive from the user or we train it on-the-fly based off of the training examples provided by the user. I'd like to pre-train one using the datasets we already collected for our paper and have it as default option but it needs to live somewhere in the package as a serialised object (e.g., pickle file). What would be the best way to do that without messing up too much with the package structure. Thanks. |
Do you have an estimate on how large this file would be? If it's tens of MBs that's probably not a problem, but if we're looking at hundreds of MBs or a GB+ then could be a different story. |
|
@psschwei it's actually not that big. the one we trained for our paper was about 250KB. it's a basic sklearn RandomForestClassifier. |
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
cool, I don't think that will be a problem |
|
I'm assuming then we would need to add sklearn as another dependency? If we're pickling in a file, we probably should consider how to best do so safely. |
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
|
@psschwei yes, added |
sure, any suggestion is welcome :). i usually do this: pickle.dump(model, open(filename, 'wb'))
# some time later...
loaded_model = pickle.load(open(filename, 'rb')) |
| except ImportError: | ||
| msg = ( | ||
| "scipy is required for harmonic mean aggregation. " | ||
| "Please install with `pip install scipy`." |
There was a problem hiding this comment.
correct command, but if this is base library should we have scipy as a core dependency? And what's the relationship to granite-retriever?
There was a problem hiding this comment.
I added scipy to the core dependencies next to numpy and scikit-learn required by this sampling strategy. However, as far as I know scikit-learn requires scipy, so probably we only need numpy and scikit-learn. Please advise.
The granite_retriever dependency group defined in pyproject.toml already contains the sentence-transformers required by the sbert similarity metric. Should we move sentence-transformers to the core dependencies?
Apologies noticed this after doing a per-line review. Also I'm not 100% sure of the intent -- if this is stdlib is it core function? If so shouldn't all dependencies be in our core dependencies. |
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Co-authored-by: Nigel Jones <nigel.l.jones+git@gmail.com>
Agree, deps should be in the core deps if this is going into core. |
So the key decision here is where this code belongs. What do you think @jakelorocco -- is this core stdlib, or an optional, additional strategy? I can see there might be three options once we've agreed on this we can nail the actual dependency config needed. |
Signed-off-by: Radu Marinescu <radu.marinescu@ie.ibm.com>
Sampling Strategy PR
Use this template when adding or modifying sampling strategies in
mellea/stdlib/sampling/.Description
Implementation Checklist
Base Class
BaseSamplingStrategyif your changes are mostly modifying therepairand/orselect_from_failurefunctionsSamplingStrategyif your changes involve a newsamplemethodReturn Value
SamplingResult. Specifically, this means:ModelOutputThunks insample_generationsare properly typed from the Component and theparsed_repris the expected type.Integration
mellea/stdlib/sampling/__init__.pyTesting
tests/sampling/