Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HumanEvalFix integration #1908

Merged
merged 26 commits into from
May 23, 2024
Merged

HumanEvalFix integration #1908

merged 26 commits into from
May 23, 2024

Conversation

Muennighoff
Copy link
Contributor

@Muennighoff Muennighoff commented May 20, 2024

Integrates HumanEvalFix from https://arxiv.org/abs/2308.07124

@Muennighoff Muennighoff changed the title Preliminary HumanEvalFix integration HumanEvalFix integration May 20, 2024
@li-boxuan
Copy link
Collaborator

@Muennighoff This is awesome, thank you! Is it ready to try out, or still under development?

@yufansong
Copy link
Collaborator

yufansong commented May 20, 2024

@Muennighoff I see there are some todos in your doc. If it is done, could you plz elaborate more context in PR discription? If you are still dev, you can make this PR as a draft one.

[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
Copy link
Collaborator

@li-boxuan li-boxuan May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably also want to include enable_auto_lint = true. Evaluation of CodeActAgent on SWE-bench-lite shows that this option could give the LLM a hint of indentation errors, and thus boosts the final score (if the language is python).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@li-boxuan fixed

@Muennighoff Muennighoff marked this pull request as draft May 20, 2024 04:28
@Muennighoff
Copy link
Contributor Author

@Muennighoff I see there are some todos in your doc. If it is done, could you plz elaborate more context in PR discription? If you are still dev, you can make this PR as a draft one.

I've converted to draft, sorry! One todo is regarding programming languages - I'm unsure if it makes sense to also add evaluation for the other prog. langs in HumanEvalFix (Rust, C++, Java, JS, Go) or only Python?

@Muennighoff
Copy link
Contributor Author

Also cc @tangxiangru who is also working on the integration

@li-boxuan
Copy link
Collaborator

li-boxuan commented May 20, 2024

I'm unsure if it makes sense to also add evaluation for the other prog. langs in HumanEvalFix (Rust, C++, Java, JS, Go) or only Python?

I haven't read your paper yet, so please take my thoughts with a grain of salt:

  1. We should prioritize on whatever is easier to do first.
  2. Eventually, it's a great idea to include other programming languages. SWE-bench only involves python repositories, which doesn't 100% reflect the quality of OpenDevin in a broader scope. In reality, people use OpenDevin not just for python tasks.

Copy link
Collaborator

@xingyaoww xingyaoww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great to me! Without too much efforts, i can actually run this and get some meaningful results!

image

evaluation/humanevalfix/run_infer.py Show resolved Hide resolved
evaluation/humanevalfix/scripts/run_infer.sh Show resolved Hide resolved

You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`.

## Evaluate Generated Patches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this section necessary for HumanEvalFix? If not we can remove it!

evaluation/humanevalfix/README.md Show resolved Hide resolved
evaluation/humanevalfix/README.md Outdated Show resolved Hide resolved
instance.declaration + instance.buggy_solution + '\n' + instance.test
)
path = os.path.join(
workspace_mount_path, f'{instance.task_id.replace("/", "__")}.py'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do instance.task_id.replace("/", "__") because instance id in the task has / which will be interpreted as a new folder and can cause issues


# reset workspace to config
config.workspace_base = workspace_mount_path
config.workspace_mount_path = workspace_mount_path
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these two lines so that the new mount path can be handled by the sandbox

echo "WARNING: You are about to enable the execution of untrusted model-generated code by setting the environment variable HF_ALLOW_CODE_EVAL to '1'."
echo "It is highly unlikely that model-generated code will do something overtly malicious in response to this test suite, however, it may act destructively due to a lack of model capability or alignment."
echo "Please confirm that you have read the disclaimer, taken the necessary precautions, and wish to proceed (y/n):"
read user_input
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this interactive command line to set HF_ALLOW_CODE_EVAL to 1 after user acknowledge the warning.

@Muennighoff Muennighoff marked this pull request as ready for review May 21, 2024 21:32
@Muennighoff
Copy link
Contributor Author

Overall looks great to me! Without too much efforts, i can actually run this and get some meaningful results!

image

Amazing thanks so much for taking a look & your fixes! I have moved the PR out of draft mode.
Tagging @huybery & @tangxiangru in case they want to take a look 🤗

Copy link
Collaborator

@yufansong yufansong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave some nits. Most LGTM

evaluation/humanevalfix/README.md Outdated Show resolved Hide resolved
evaluation/humanevalfix/README.md Outdated Show resolved Hide resolved
evaluation/humanevalfix/run_infer.py Outdated Show resolved Hide resolved
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
Muennighoff and others added 7 commits May 21, 2024 20:48
Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>
fix a bug:

ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError>
concurrent.futures.process._RemoteTraceback:
added an example
added: enable_auto_lint = true
test_result = {'result': {}, 'metadata': {}}
code_metric = load('Muennighoff/code_eval_octopack')
timeout = LANGUAGE_TO_TIMEOUT[language]
num_workers = LANGUAGE_TO_NUM_WORKERS[language]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added this, otherwise will be:

ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError>
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/process.py", line 263, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 217, in process_instance
test_result = get_test_result(instance, path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 126, in get_test_result
num_workers=num_workers,
^^^^^^^^^^^
NameError: name 'num_workers' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 340, in _invoke_callbacks
callback(self)
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 343, in update_progress
output = future.result()
^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
NameError: name 'num_workers' is not defined
20:48:42 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:43 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:43 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:43 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:44 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
ERROR:root: File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 340, in _invoke_callbacks
callback(self)
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 343, in update_progress
output = future.result()
^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 374, in
future.result()
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception

ERROR:root:<class 'NameError'>: name 'num_workers' is not defined
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [07:13<00:00, 86.64s/it]
Exception ignored in: <function _ExecutorManagerThread.init..weakref_cb at 0x309c5ede0>
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/process.py", line 310, in weakref_cb
AttributeError: 'NoneType' object has no attribute 'util'

[core]
max_iterations = 100
cache_dir = "/tmp/cache"
ssh_hostname = "localhost"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@li-boxuan fixed

@xingyaoww xingyaoww enabled auto-merge (squash) May 23, 2024 12:56
@xingyaoww xingyaoww merged commit ef6cdb7 into OpenDevin:main May 23, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants