Docs

OLAB LLM Inference Service

OpenAI‑style, high‑performance inference on NYU Langone H100s for modern open‑source LLMs — free for internal research. Includes GUI and API access.

Introduction

The NYULH‑OLAB inference service offers an effortless, OpenAI‑compatible API and a simple GUI to access state‑of‑the‑art open‑source LLMs on H100 nodes. We target fast, research‑grade serving and will also include internal models (e.g., NYUTron, Lang‑One) over time.

GUI

On NYU Langone Wifi:
http://10.189.26.12:31111

API: Account creation

  1. Email eric.oermann@nyulangone.org or jaden.stryker@nyulangone.org for access. We will provision an API token tied to your @nyulangone.org email.
  2. If you will handle PHI, tell us. We’ll disable logging and ensure requests are ephemeral.
  3. Do not share API keys; we track usage to improve reliability.

API: Making a request

  • You must be on‑site (NYU Langone network) or on a NYU server.
  • OpenAI‑compatible API surface — cURL, Python requests, or OpenAI SDKs will work.
  • Examples below demonstrate cURL, Python requests, and OpenAI SDK usage.

Examples

cURL
curl -H 'apiKey: eric.oermann@nyulangone.org' -X 'POST' 'http://10.189.26.12:30080/model/llama3-3-70b-chat/v1/chat/completions'   -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
  "model": "llama3-3-70b-chat",
  "messages": [{"role":"user","content":"Hello there how are you?"},{"role":"assistant","content":"Good and you?"},{"role":"user","content":"When was NYU Langone Hospital founded?"}],
  "max_tokens": 50,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "stop": "string",
  "frequency_penalty": 0.0
}'
Python requests
import requests

url = "http://10.189.26.12:30080/model/llama3-3-70B-DSR1/v1/chat/completions"
headers = {
  "apiKey": "eric.oermann@nyulangone.org",  # your email (all lowercase)
  "accept": "application/json",
  "Content-Type": "application/json",
}
messages = [
  {"role":"user","content":"Hello there how are you?"},
  {"role":"assistant","content":"Good and you?"},
  {"role":"user","content":"When was NYU Langone Hospital founded?"},
]
data = {
  "model": "llama3-3-70B-DSR1",
  "messages": messages,
  "max_tokens": 200,
  "top_p": 1,
  "n": 1,
  "stream": False,
  "stop": "string",
  "frequency_penalty": 0.0,
}
resp = requests.post(url, headers=headers, json=data)
print(resp.status_code)
print(resp.json())
OpenAI SDK
from openai import OpenAI
client = OpenAI(
  base_url="http://10.189.26.12:30080/model/llama3-3-70b-chat/v1/",
  api_key="jaden.stryker@nyulangone.org",
)
resp = client.chat.completions.create(
  model="llama3-3-70b-chat",
  messages=[{"role":"user","content":"San Francisco is a"}],
  max_tokens=5,
  top_p=1,
  n=1,
  stream=False,
  stop="1111",
  frequency_penalty=0.0,
)
print(resp)
Swap models by changing the model id (e.g., llama3-3-70B-DSR1llama3-3-70b-chat).

Sampling params

See vLLM sampling parameters: docs.vllm.ai

Error messages

  • 401 Unauthorized — Check API key; ensure it’s lowercase.
  • 404 Not Found — Verify URL and route; remove trailing slash if present.
  • 500 Internal Server Error — Keep messages under ~4000 tokens (~3000 words). If it persists, contact jaden.stryker@nyulangone.org.
  • 429 API rate limit exceeded — You exceeded requests/second quota.
  • no Route matched with those values — Remove trailing slash in the model URL.