HumanEvalFix integration #1908

Muennighoff · 2024-05-20T01:43:25Z

Integrates HumanEvalFix from https://arxiv.org/abs/2308.07124

li-boxuan · 2024-05-20T04:24:58Z

@Muennighoff This is awesome, thank you! Is it ready to try out, or still under development?

yufansong · 2024-05-20T04:27:01Z

@Muennighoff I see there are some todos in your doc. If it is done, could you plz elaborate more context in PR discription? If you are still dev, you can make this PR as a draft one.

li-boxuan · 2024-05-20T04:27:09Z

evaluation/humanevalfix/README.md

+[core]
+max_iterations = 100
+cache_dir = "/tmp/cache"
+ssh_hostname = "localhost"


You probably also want to include enable_auto_lint = true. Evaluation of CodeActAgent on SWE-bench-lite shows that this option could give the LLM a hint of indentation errors, and thus boosts the final score (if the language is python).

@li-boxuan fixed

Muennighoff · 2024-05-20T04:30:12Z

@Muennighoff I see there are some todos in your doc. If it is done, could you plz elaborate more context in PR discription? If you are still dev, you can make this PR as a draft one.

I've converted to draft, sorry! One todo is regarding programming languages - I'm unsure if it makes sense to also add evaluation for the other prog. langs in HumanEvalFix (Rust, C++, Java, JS, Go) or only Python?

Muennighoff · 2024-05-20T04:31:29Z

Also cc @tangxiangru who is also working on the integration

li-boxuan · 2024-05-20T04:34:37Z

I'm unsure if it makes sense to also add evaluation for the other prog. langs in HumanEvalFix (Rust, C++, Java, JS, Go) or only Python?

I haven't read your paper yet, so please take my thoughts with a grain of salt:

We should prioritize on whatever is easier to do first.
Eventually, it's a great idea to include other programming languages. SWE-bench only involves python repositories, which doesn't 100% reflect the quality of OpenDevin in a broader scope. In reality, people use OpenDevin not just for python tasks.

fix: task in that contains /

xingyaoww

Overall looks great to me! Without too much efforts, i can actually run this and get some meaningful results!

evaluation/humanevalfix/run_infer.py

evaluation/humanevalfix/scripts/run_infer.sh

xingyaoww · 2024-05-21T08:00:51Z

evaluation/humanevalfix/README.md

+
+You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`.
+
+## Evaluate Generated Patches


Is this section necessary for HumanEvalFix? If not we can remove it!

evaluation/humanevalfix/README.md

xingyaoww · 2024-05-21T08:02:38Z

evaluation/humanevalfix/run_infer.py

+        instance.declaration + instance.buggy_solution + '\n' + instance.test
+    )
+    path = os.path.join(
+        workspace_mount_path, f'{instance.task_id.replace("/", "__")}.py'


We need to do instance.task_id.replace("/", "__") because instance id in the task has / which will be interpreted as a new folder and can cause issues

xingyaoww · 2024-05-21T08:03:03Z

evaluation/humanevalfix/run_infer.py

+
+    # reset workspace to config
+    config.workspace_base = workspace_mount_path
+    config.workspace_mount_path = workspace_mount_path


I added these two lines so that the new mount path can be handled by the sandbox

xingyaoww · 2024-05-21T08:23:02Z

evaluation/humanevalfix/scripts/run_infer.sh

+echo "WARNING: You are about to enable the execution of untrusted model-generated code by setting the environment variable HF_ALLOW_CODE_EVAL to '1'."
+echo "It is highly unlikely that model-generated code will do something overtly malicious in response to this test suite, however, it may act destructively due to a lack of model capability or alignment."
+echo "Please confirm that you have read the disclaimer, taken the necessary precautions, and wish to proceed (y/n):"
+read user_input


I add this interactive command line to set HF_ALLOW_CODE_EVAL to 1 after user acknowledge the warning.

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

Muennighoff · 2024-05-21T21:33:27Z

Overall looks great to me! Without too much efforts, i can actually run this and get some meaningful results!

Amazing thanks so much for taking a look & your fixes! I have moved the PR out of draft mode.
Tagging @huybery & @tangxiangru in case they want to take a look 🤗

yufansong

Leave some nits. Most LGTM

evaluation/humanevalfix/README.md

evaluation/humanevalfix/run_infer.py

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

fix a bug: ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError> concurrent.futures.process._RemoteTraceback:

added an example

added: enable_auto_lint = true

tangxiangru · 2024-05-22T13:08:56Z

evaluation/humanevalfix/run_infer.py

+    test_result = {'result': {}, 'metadata': {}}
+    code_metric = load('Muennighoff/code_eval_octopack')
+    timeout = LANGUAGE_TO_TIMEOUT[language]
+    num_workers = LANGUAGE_TO_NUM_WORKERS[language]


I just added this, otherwise will be:

ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError>
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/process.py", line 263, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 217, in process_instance
test_result = get_test_result(instance, path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 126, in get_test_result
num_workers=num_workers,
^^^^^^^^^^^
NameError: name 'num_workers' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 340, in _invoke_callbacks
callback(self)
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 343, in update_progress
output = future.result()
^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
NameError: name 'num_workers' is not defined
20:48:42 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:43 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:43 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:43 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
20:48:44 - opendevin:INFO: browser_env.py:105 - BrowserEnv already closed, no need to close again
ERROR:root: File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 340, in _invoke_callbacks
callback(self)
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 343, in update_progress
output = future.result()
^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/Users/Clash/Documents/LLM-Repos/OpenDevin/evaluation/humanevalfix/run_infer.py", line 374, in
future.result()
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception

ERROR:root:<class 'NameError'>: name 'num_workers' is not defined
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [07:13<00:00, 86.64s/it]
Exception ignored in: <function _ExecutorManagerThread.init..weakref_cb at 0x309c5ede0>
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.3/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/process.py", line 310, in weakref_cb
AttributeError: 'NoneType' object has no attribute 'util'

tangxiangru · 2024-05-22T13:23:11Z

evaluation/humanevalfix/README.md

+[core]
+max_iterations = 100
+cache_dir = "/tmp/cache"
+ssh_hostname = "localhost"


@li-boxuan fixed

add: evaluate package

update poetry.lock

Muennighoff added 2 commits May 19, 2024 18:43

Preliminary HumanEvalFix integration

677aec1

Clean paths

931e0fd

Muennighoff changed the title ~~Preliminary HumanEvalFix integration~~ HumanEvalFix integration May 20, 2024

yufansong requested review from xingyaoww, li-boxuan and yufansong May 20, 2024 04:21

li-boxuan reviewed May 20, 2024

View reviewed changes

Muennighoff marked this pull request as draft May 20, 2024 04:28

huybery added the evaluation label May 20, 2024

huybery mentioned this pull request May 20, 2024

Add: a mechanism for tracking contributions to the paper #1917

Draft

xingyaoww added 6 commits May 21, 2024 15:54

fix: set workspace path correctly for config

dc99ac2

fix: task in that contains /

add missing run_infer.sh

4c9c9db

update run_infer w/o hard coded agent

de41c3e

fix typo

ba32bfb

change instance_id to task_id

7964b81

add the warning and env var setting to run_infer.sh

6af2898

xingyaoww reviewed May 21, 2024

View reviewed changes

xingyaoww and others added 4 commits May 21, 2024 16:45

reset back workspace mount at the end of each instance

4ae0d36

10 max iter is probably enough for humanevalfix

01e9b54

Remove unneeded section

114d5b7

Co-authored-by: Xingyao Wang <xingyao6@illinois.edu>

Merge branch 'main' into humanevalfix

02c522c

Muennighoff marked this pull request as ready for review May 21, 2024 21:32

yufansong approved these changes May 22, 2024

View reviewed changes

evaluation/humanevalfix/README.md Outdated Show resolved Hide resolved

evaluation/humanevalfix/README.md Outdated Show resolved Hide resolved

evaluation/humanevalfix/run_infer.py Outdated Show resolved Hide resolved

Fix link

cacb3df

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

Muennighoff and others added 7 commits May 21, 2024 20:48

Use logger

4d32536

Co-authored-by: Yufan Song <33971064+yufansong@users.noreply.github.com>

Update run_infer.py

de5c37f

fix a bug: ERROR:concurrent.futures:exception calling callback for <Future at 0x309cbc470 state=finished raised NameError> concurrent.futures.process._RemoteTraceback:

Update README.md

116ff4d

Update README.md

a551b63

Update README.md

6127082

Update README.md

20d62a8

added an example

Update README.md

f890484

added: enable_auto_lint = true

tangxiangru reviewed May 22, 2024

View reviewed changes

tangxiangru and others added 6 commits May 22, 2024 09:37

Update pyproject.toml

c715e90

add: evaluate package

Delete poetry.lock

3e18094

update poetry.lock

update poetry.lock

3aab77d

update poetry.lock

Update README.md

0679320

Update README.md

60709ca

Merge branch 'main' into humanevalfix

66b1c4b

xingyaoww enabled auto-merge (squash) May 23, 2024 12:56

xingyaoww merged commit ef6cdb7 into OpenDevin:main May 23, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HumanEvalFix integration #1908

HumanEvalFix integration #1908

Muennighoff commented May 20, 2024 •

edited

li-boxuan commented May 20, 2024

yufansong commented May 20, 2024 •

edited

li-boxuan May 20, 2024 •

edited

tangxiangru May 22, 2024

Muennighoff commented May 20, 2024

Muennighoff commented May 20, 2024

li-boxuan commented May 20, 2024 •

edited

xingyaoww left a comment

xingyaoww May 21, 2024

xingyaoww May 21, 2024

xingyaoww May 21, 2024

xingyaoww May 21, 2024

Muennighoff commented May 21, 2024

yufansong left a comment

tangxiangru May 22, 2024

tangxiangru May 22, 2024


		You can replace `eval_gpt4_1106_preview` with any model you set up in `config.toml`.

		## Evaluate Generated Patches

HumanEvalFix integration #1908

HumanEvalFix integration #1908

Conversation

Muennighoff commented May 20, 2024 • edited

li-boxuan commented May 20, 2024

yufansong commented May 20, 2024 • edited

li-boxuan May 20, 2024 • edited

Choose a reason for hiding this comment

tangxiangru May 22, 2024

Choose a reason for hiding this comment

Muennighoff commented May 20, 2024

Muennighoff commented May 20, 2024

li-boxuan commented May 20, 2024 • edited

xingyaoww left a comment

Choose a reason for hiding this comment

xingyaoww May 21, 2024

Choose a reason for hiding this comment

xingyaoww May 21, 2024

Choose a reason for hiding this comment

xingyaoww May 21, 2024

Choose a reason for hiding this comment

xingyaoww May 21, 2024

Choose a reason for hiding this comment

Muennighoff commented May 21, 2024

yufansong left a comment

Choose a reason for hiding this comment

tangxiangru May 22, 2024

Choose a reason for hiding this comment

tangxiangru May 22, 2024

Choose a reason for hiding this comment

Muennighoff commented May 20, 2024 •

edited

yufansong commented May 20, 2024 •

edited

li-boxuan May 20, 2024 •

edited

li-boxuan commented May 20, 2024 •

edited