Docs

OLAB LLM Inference Service

OpenAI‑style, high‑performance inference on NYU Langone H100s for modern open‑source LLMs — free for internal research. Includes GUI and API access.

View raw markdown

Introduction

The NYULH‑OLAB inference service offers an effortless, OpenAI‑compatible API and a simple GUI to access state‑of‑the‑art open‑source LLMs on H100 nodes. We target fast, research‑grade serving and will also include internal models (e.g., NYUTron, Lang‑One) over time.

GUI

On NYU Langone Wifi:

http://10.189.26.12:31111

API: Account creation

Email eric.oermann@nyulangone.org or jaden.stryker@nyulangone.org for access. We will provision an API token tied to your @nyulangone.org email.
If you will handle PHI, tell us. We’ll disable logging and ensure requests are ephemeral.
Do not share API keys; we track usage to improve reliability.

API: Making a request

You must be on‑site (NYU Langone network) or on a NYU server.
OpenAI‑compatible API surface — cURL, Python requests, or OpenAI SDKs will work.
Examples below demonstrate cURL, Python requests, and OpenAI SDK usage.

Examples

cURL

curl -H 'apiKey: eric.oermann@nyulangone.org' -X 'POST' 'http://10.189.26.12:30080/model/llama3-3-70b-chat/v1/chat/completions'   -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
  "model": "llama3-3-70b-chat",
  "messages": [{"role":"user","content":"Hello there how are you?"},{"role":"assistant","content":"Good and you?"},{"role":"user","content":"When was NYU Langone Hospital founded?"}],
  "max_tokens": 50,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "stop": "string",
  "frequency_penalty": 0.0
}'

Python requests

import requests

url = "http://10.189.26.12:30080/model/llama3-3-70B-DSR1/v1/chat/completions"
headers = {
  "apiKey": "eric.oermann@nyulangone.org",  # your email (all lowercase)
  "accept": "application/json",
  "Content-Type": "application/json",
}
messages = [
  {"role":"user","content":"Hello there how are you?"},
  {"role":"assistant","content":"Good and you?"},
  {"role":"user","content":"When was NYU Langone Hospital founded?"},
]
data = {
  "model": "llama3-3-70B-DSR1",
  "messages": messages,
  "max_tokens": 200,
  "top_p": 1,
  "n": 1,
  "stream": False,
  "stop": "string",
  "frequency_penalty": 0.0,
}
resp = requests.post(url, headers=headers, json=data)
print(resp.status_code)
print(resp.json())

OpenAI SDK

from openai import OpenAI
client = OpenAI(
  base_url="http://10.189.26.12:30080/model/llama3-3-70b-chat/v1/",
  api_key="jaden.stryker@nyulangone.org",
)
resp = client.chat.completions.create(
  model="llama3-3-70b-chat",
  messages=[{"role":"user","content":"San Francisco is a"}],
  max_tokens=5,
  top_p=1,
  n=1,
  stream=False,
  stop="1111",
  frequency_penalty=0.0,
)
print(resp)

Swap models by changing the model id (e.g., llama3-3-70B-DSR1 → llama3-3-70b-chat).

Sampling params

See vLLM sampling parameters: docs.vllm.ai

Error messages

401 Unauthorized — Check API key; ensure it’s lowercase.
404 Not Found — Verify URL and route; remove trailing slash if present.
500 Internal Server Error — Keep messages under ~4000 tokens (~3000 words). If it persists, contact jaden.stryker@nyulangone.org.
429 API rate limit exceeded — You exceeded requests/second quota.
no Route matched with those values — Remove trailing slash in the model URL.