-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add feature to drop last n keyframes for delta_timestamps #129
Conversation
5ac046b
to
b699a2f
Compare
# we drop those indices pertaining to the last n frames of each episode. | ||
self.index = [] | ||
for from_ix, to_ix in zip(*self.episode_data_index.values(), strict=True): | ||
self.index.extend(list(range(from_ix, to_ix - n_end_keyframes_dropped))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will leave self.hf_dataset["index']
and self.hf_dataset["episode_index"]
in an inconsistent state with self.index
and self.episode_data_index
.
One problem is that we are duplicating the information that exists on the hf_dataset
, namely hf_dataset["index"]
so that we can mutate it. So instead of LeRobotDataset
being a lightweight wrapper around an hf_dataset
we now introduce this new state in between (self.index
) that the maintainers and the users will have to know and remember about.
Another problem is that what should self.episode_data_index
now reflect? self.episode_data_index
now holds info for the data in hf_dataset
but not for self.index
, this can be confusing.
And the dependencies on this new introduced state only cascade down, where num_samples
no longer depends directly on self.hf_dataset
, etc.
So the intentions here are very good -- we want to make the life better for the users (so that they can train better models) and in some sense this implementation adds the new functionality in as straightforward way as possible, but the side effect here is that the complexity of the LeRobotDataset
grows quite significantly and it is not clear anymore what what depends on. This can lead to subtle bugs and will make extending the functionality much harder down the road (and also new users to the codebase will have a much harder time understanding how things work -- for instance, what is the difference between self.hf_dataset["index"]
and self.index
and under what conditions they are different? Yeah, I know the answers to these questions but that is because I spent a good bit of time with this code now, but for new folks that might be quite challenging)
One potential way of tackling this might be to attack it via sampling -- in the same way as we have drop_last
on the pytorch DataLoader
. We could have a custom sampler we could pass to the dataloader that would drop the last n examples from each episode based on the episode_data_index
.
The advantage here is that we would not be adding complexity to the LeRobotDataset
. It would remain truer to its original intention, that is not as a way to modify data but as a way to add to a HuggingFace dataset in order to cater to the needs of training policies for controlling robots. But the dependency would be one way -- LeRobotDataset
would depend on the hf_dataset
it gets passed but the dependency would not go the other way. For all intents and purposes data contained in the hf_dataset
would remain immutable, we would not be modifying the data in the hf_dataset
nor would we be creating some state standing in for aspects of the hf_dataset
. Our users would know where the single point of truth always is.
And also the full functionality would be localized to a Sampler
. Yeah, if users would want to make avail of this functionality they would need to know of the existence of the Sampler (maybe a good place to introduce this would be as an example in the doc string for DiffusionPolicy
) but we would not be adding extra complexity to the LeRobotDataset
and arguably might be fighting complexity in the codebase via reducing coupling and doing something like dependency injection at the level of the dataloader.
Sorry, these are just some loose thoughts, though I understand that this problem is not easy to solve. But maybe some of this might be helpful.
I'll keep thinking about this and if anything comes to mind will comment again in his PR 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for reference, here is a SubsetRandomSampler from pytorch.
Might be nice to inherit from it and instatiate it with episode_data_index
and n_end_keyframes_dropped
. A minimal implementation might be utils function that would return an index from a LeRobotDataset with the n_end_keyframes_dropped
excluded.
Anyhow, sorry 🙂 Not sure if any of this is helpful, I might be completely off the mark with the intention behind this PR.
Either way, it is very good to know of this difference vs the original training on pusht
, so thank you for pointing me to this PR, @alexander-soare! 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@radekosmulski I find your arguments very convincing :) Would you be interested in having a crack at the sampler approach by any chance? cc @Cadene who had this on his TODO list at some point but it got shoved back by other prios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure thing @alexander-soare, will give it a go! 🙂
What does this PR do?
Fixes a discrepancy between data loading in the original diffusion policy work and LeRobot.
The functionality is summarized in
diffusion.yaml
. I will copy it here verbatim for reference:How was it tested?
This change improves eval metrics slightly (blue = new).
I also ran a whole epoch of data loading with LeRobot and the original diffusion policy code side by side, checking batch equality with inter-process communication.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR. Try to avoid tagging more than 3 people.
This change is