Skip to main content
Open In ColabOpen on GitHub

Azure Cosmos DB No SQL

This notebook shows you how to leverage this integrated vector database to store documents in collections, create indicies and perform vector search queries using approximate nearest neighbor algorithms such as COS (cosine distance), L2 (Euclidean distance), and IP (inner product) to locate documents close to the query vectors.

Azure Cosmos DB is the database that powers OpenAI's ChatGPT service. It offers single-digit millisecond response times, automatic and instant scalability, along with guaranteed speed at any scale.

Azure Cosmos DB for NoSQL now offers vector indexing and search in preview. This feature is designed to handle high-dimensional vectors, enabling efficient and accurate vector search at any scale. You can now store vectors directly in the documents alongside your data. This means that each document in your database can contain not only traditional schema-free data, but also high-dimensional vectors as other properties of the documents. This colocation of data and vectors allows for efficient indexing and searching, as the vectors are stored in the same logical unit as the data they represent. This simplifies data management, AI application architectures, and the efficiency of vector-based operations.

Please refer here for more details:

Sign Up for lifetime free access to get started today.

%pip install --upgrade --quiet azure-cosmos langchain-openai langchain-community
Note: you may need to restart the kernel to use updated packages.
OPENAI_API_KEY = "YOUR_KEY"
OPENAI_API_TYPE = "azure"
OPENAI_API_VERSION = "2023-05-15"
OPENAI_API_BASE = "YOUR_ENDPOINT"
OPENAI_EMBEDDINGS_MODEL_NAME = "text-embedding-ada-002"
OPENAI_EMBEDDINGS_MODEL_DEPLOYMENT = "text-embedding-ada-002"

Insert Data

from langchain_community.document_loaders import PyPDFLoader

# Load the PDF
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
API Reference:PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)
print(docs[0])
page_content='GPT-4 Technical Report
OpenAI∗
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance
on various professional and academic benchmarks, including passing a simulated
bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-
based model pre-trained to predict the next token in a document. The post-training
alignment process results in improved performance on measures of factuality and
adherence to desired behavior. A core component of this project was developing
infrastructure and optimization methods that behave predictably across a wide
range of scales. This allowed us to accurately predict some aspects of GPT-4’s
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction' metadata={'source': 'https://arxiv.org/pdf/2303.08774.pdf', 'page': 0}
indexing_policy = {
"indexingMode": "consistent",
"includedPaths": [{"path": "/*"}],
"excludedPaths": [{"path": '/"_etag"/?'}],
"vectorIndexes": [{"path": "/embedding", "type": "diskANN"}],
"fullTextIndexes": [{"path": "/text"}],
}

vector_embedding_policy = {
"vectorEmbeddings": [
{
"path": "/embedding",
"dataType": "float32",
"distanceFunction": "cosine",
"dimensions": 1536,
}
]
}

full_text_policy = {
"defaultLanguage": "en-US",
"fullTextPaths": [{"path": "/text", "language": "en-US"}],
}
from azure.cosmos import CosmosClient, PartitionKey
from langchain_community.vectorstores.azure_cosmos_db_no_sql import (
AzureCosmosDBNoSqlVectorSearch,
)
from langchain_openai import OpenAIEmbeddings

HOST = "AZURE_COSMOS_DB_ENDPOINT"
KEY = "AZURE_COSMOS_DB_KEY"

cosmos_client = CosmosClient(HOST, KEY)
database_name = "langchain_python_db"
container_name = "langchain_python_container"
partition_key = PartitionKey(path="/id")
cosmos_container_properties = {"partition_key": partition_key}

openai_embeddings = OpenAIEmbeddings(
deployment="smart-agent-embedding-ada",
model="text-embedding-ada-002",
chunk_size=1,
openai_api_key="OPENAI_API_KEY",
)

# insert the documents in AzureCosmosDBNoSql with their embedding
vector_search = AzureCosmosDBNoSqlVectorSearch.from_documents(
documents=docs,
embedding=openai_embeddings,
cosmos_client=cosmos_client,
database_name=database_name,
container_name=container_name,
vector_embedding_policy=vector_embedding_policy,
full_text_policy=full_text_policy,
indexing_policy=indexing_policy,
cosmos_container_properties=cosmos_container_properties,
cosmos_database_properties={},
full_text_search_enabled=True,
)
# Perform a similarity search between the embedding of the query and the embeddings of the documents
import json

query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)

print(results[0].page_content)
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction
This technical report presents GPT-4, a large multimodal model capable of processing image and
text inputs and producing text outputs. Such models are an important area of study as they have the
potential to be used in a wide range of applications, such as dialogue systems, text summarization,
and machine translation. As such, they have been the subject of substantial interest and progress in
recent years [1–34].
One of the main goals of developing such models is to improve their ability to understand and generate
natural language text, particularly in more complex and nuanced scenarios. To test its capabilities
in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In
these evaluations it performs quite well and often outscores the vast majority of human test takers.

Vector Search with Score

query = "What were the compute requirements for training GPT 4"

results = vector_search.similarity_search_with_score(
query=query,
k=5,
)

# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1:  {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 1: 0.8394796122122777


Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":43,"id":"e5610de3-8af6-43b9-8266-51c26d76eaa3"},"page_content":"2 GPT-4 Observed Safety Challenges\nGPT-4 demonstrates increased performance in areas such as reasoning, knowledge retention, and\ncoding, compared to earlier models such as GPT-2[ 22] and GPT-3.[ 10] Many of these improvements\nalso present new safety challenges, which we highlight in this section.\nWe conducted a range of qualitative and quantitative evaluations of GPT-4. These evaluations\nhelped us gain an understanding of GPT-4’s capabilities, limitations, and risks; prioritize our\nmitigation efforts; and iteratively test and build safer versions of the model. Some of the specific\nrisks we explored are:6\n•Hallucinations\n•Harmful content\n•Harms of representation, allocation, and quality of service\n•Disinformation and influence operations\n•Proliferation of conventional and unconventional weapons\n•Privacy\n•Cybersecurity\n•Potential for risky emergent behaviors\n•Interactions with other systems\n•Economic impacts\n•Acceleration\n•Overreliance","type":"Document"}
Score 2: 0.8299261339098007


Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"cddfb7ac-d953-46f4-8a48-76655f116bcf"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction","type":"Document"}
Score 3: 0.8286253517208215


Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":3,"id":"4f3152cd-c543-4a4f-b94e-c52c4139c4a8"},"page_content":"plan to refine these methods and register performance predictions across various capabilities before\nlarge model training begins, and we hope this becomes a common goal in the field.\n4 Capabilities\nWe tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally\ndesigned for humans.4We did no specific training for these exams. A minority of the problems in the\nexams were seen by the model during training; for each exam we run a variant with these questions\nremoved and report the lower score of the two. We believe the results to be representative. For further\ndetails on contamination (methodology and per-exam statistics), see Appendix C.\nExams were sourced from publicly-available materials. Exam questions included both multiple-\nchoice and free-response questions; we designed separate prompts for each format, and images were\nincluded in the input for questions which required it. The evaluation setup was designed based","type":"Document"}
Score 4: 0.8278858118680015


Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":28,"id":"18b4f43d-27d2-404e-9d66-f9328a3588c6"},"page_content":"overall GPT-4 training budget. When mixing in data from these math benchmarks, a portion of the\ntraining data was held back, so each individual training example may or may not have been seen by\nGPT-4 during training.\nWe conducted contamination checking to verify the test set for GSM-8K is not included in the training\nset (see Appendix D). We recommend interpreting the performance results reported for GPT-4\nGSM-8K in Table 2 as something in-between true few-shot transfer and full benchmark-specific\ntuning.\nF Multilingual MMLU\nWe translated all questions and answers from MMLU [ 49] using Azure Translate. We used an\nexternal model to perform the translation, instead of relying on GPT-4 itself, in case the model had\nunrepresentative performance for its own translations. We selected a range of languages that cover\ndifferent geographic regions and scripts, we show an example question taken from the astronomy","type":"Document"}
Score 5: 0.8272138555588135

Vector Search with filtering

query = "What were the compute requirements for training GPT 4"

pre_filter = {
"conditions": [
{"property": "metadata.page", "operator": "$eq", "value": 0},
],
}

results = vector_search.similarity_search_with_score(
query=query,
k=5,
pre_filter=pre_filter,
)

# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1:  {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 1: 0.8394796122122777


Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"cddfb7ac-d953-46f4-8a48-76655f116bcf"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction","type":"Document"}
Score 2: 0.8286253517208215


Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"ba814d15-2c12-40d2-8934-db58b393ecb8"},"page_content":"model capability results, as well as model safety improvements and results, in more detail in later\nsections.\nThis report also discusses a key challenge of the project, developing deep learning infrastructure and\noptimization methods that behave predictably across a wide range of scales. This allowed us to make\npredictions about the expected performance of GPT-4 (based on small runs trained in similar ways)\nthat were tested against the final run to increase confidence in our training.\nDespite its capabilities, GPT-4 has similar limitations to earlier GPT models [ 1,37,38]: it is not fully\nreliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn\n∗Please cite this work as “OpenAI (2023)\". Full authorship contribution statements appear at the end of the\ndocument. Correspondence regarding this technical report can be sent to gpt4-report@openai.comarXiv:2303.08774v6 [cs.CL] 4 Mar 2024","type":"Document"}
Score 3: 0.8215997601854081


Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"dd040c08-6ae1-4c73-8b85-7f034d337891"},"page_content":"these evaluations it performs quite well and often outscores the vast majority of human test takers.\nFor example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.\nThis contrasts with GPT-3.5, which scores in the bottom 10%.\nOn a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models\nand most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).\nOn the MMLU benchmark [ 35,36], an English-language suite of multiple-choice questions covering\n57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but\nalso demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4\nsurpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these\nmodel capability results, as well as model safety improvements and results, in more detail in later\nsections.","type":"Document"}
Score 4: 0.8209972517303962
from langchain_community.vectorstores.azure_cosmos_db_no_sql import CosmosDBQueryType

query = "What were the compute requirements for training GPT 4"
pre_filter = {
"conditions": [
{
"property": "text",
"operator": "$full_text_contains_any",
"value": "What were the compute requirements for training GPT 4",
},
],
}
results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.FULL_TEXT_SEARCH,
pre_filter=pre_filter,
)

# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print("\n")
API Reference:CosmosDBQueryType
Result 1:  {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"cddfb7ac-d953-46f4-8a48-76655f116bcf"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nperformance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction","type":"Document"}


Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}


Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"dd040c08-6ae1-4c73-8b85-7f034d337891"},"page_content":"these evaluations it performs quite well and often outscores the vast majority of human test takers.\nFor example, on a simulated bar exam, GPT-4 achieves a score that falls in the top 10% of test takers.\nThis contrasts with GPT-3.5, which scores in the bottom 10%.\nOn a suite of traditional NLP benchmarks, GPT-4 outperforms both previous large language models\nand most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).\nOn the MMLU benchmark [ 35,36], an English-language suite of multiple-choice questions covering\n57 subjects, GPT-4 not only outperforms existing models by a considerable margin in English, but\nalso demonstrates strong performance in other languages. On translated variants of MMLU, GPT-4\nsurpasses the English-language state-of-the-art in 24 of 26 languages considered. We discuss these\nmodel capability results, as well as model safety improvements and results, in more detail in later\nsections.","type":"Document"}


Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"ba814d15-2c12-40d2-8934-db58b393ecb8"},"page_content":"model capability results, as well as model safety improvements and results, in more detail in later\nsections.\nThis report also discusses a key challenge of the project, developing deep learning infrastructure and\noptimization methods that behave predictably across a wide range of scales. This allowed us to make\npredictions about the expected performance of GPT-4 (based on small runs trained in similar ways)\nthat were tested against the final run to increase confidence in our training.\nDespite its capabilities, GPT-4 has similar limitations to earlier GPT models [ 1,37,38]: it is not fully\nreliable (e.g. can suffer from “hallucinations”), has a limited context window, and does not learn\n∗Please cite this work as “OpenAI (2023)\". Full authorship contribution statements appear at the end of the\ndocument. Correspondence regarding this technical report can be sent to gpt4-report@openai.comarXiv:2303.08774v6 [cs.CL] 4 Mar 2024","type":"Document"}


Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"9edf6760-a2d0-4a0b-a652-25fc89de1d34"},"page_content":"from experience. Care should be taken when using the outputs of GPT-4, particularly in contexts\nwhere reliability is important.\nGPT-4’s capabilities and limitations create significant and novel safety challenges, and we believe\ncareful study of these challenges is an important area of research given the potential societal impact.\nThis report includes an extensive system card (after the Appendix) describing some of the risks we\nforesee around bias, disinformation, over-reliance, privacy, cybersecurity, proliferation, and more.\nIt also describes interventions we made to mitigate potential harms from the deployment of GPT-4,\nincluding adversarial testing with domain experts, and a model-assisted safety pipeline.\n2 Scope and Limitations of this Technical Report\nThis report focuses on the capabilities, limitations, and safety properties of GPT-4. GPT-4 is a\nTransformer-style model [ 39] pre-trained to predict the next token in a document, using both publicly","type":"Document"}

Full Text Search BM 25 Ranking

query = "What were the compute requirements for training GPT 4"

results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.FULL_TEXT_RANK,
)

# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print("\n")
Result 1:  {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":2,"id":"f2746fd3-bbcb-4197-b2d5-ee7b355b6009"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}


Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"20153a6c-7c2c-4328-9b0e-e3502d7ac4dd"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}


Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":2,"id":"6d88f369-4147-4530-9bfb-0ed008413211"},"page_content":"Observed\nPrediction\ngpt-4\n100p 10n 1µ 100µ 0.01 1\nCompute1.02.03.04.05.06.0Bits per wordOpenAI codebase next word predictionFigure 1. Performance of GPT-4 and smaller models. The metric is final loss on a dataset derived\nfrom our internal codebase. This is a convenient, large dataset of code tokens which is not contained in\nthe training set. We chose to look at loss because it tends to be less noisy than other measures across\ndifferent amounts of training compute. A power law fit to the smaller models (excluding GPT-4) is\nshown as the dotted line; this fit accurately predicts GPT-4’s final loss. The x-axis is training compute\nnormalized so that GPT-4 is 1.\nObserved\nPrediction\ngpt-4\n1µ 10µ 100µ 0.001 0.01 0.1 1\nCompute012345– Mean Log Pass RateCapability prediction on 23 coding problems\nFigure 2. Performance of GPT-4 and smaller models. The metric is mean log pass rate on a subset of\nthe HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted","type":"Document"}


Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"90e4bafe-55bb-406b-afba-a0143c810842"},"page_content":"which measures the ability to synthesize Python functions of varying complexity. We successfully\npredicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained\nwith at most 1,000×less compute (Figure 2).\nFor an individual problem in HumanEval, performance may occasionally worsen with scale. Despite\nthese challenges, we find an approximate power law relationship −EP[log(pass _rate(C))] = α∗C−k\n2In addition to the accompanying system card, OpenAI will soon publish additional thoughts on the social\nand economic implications of AI systems, including the need for effective regulation.\n2","type":"Document"}


Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":71,"id":"10ff5a7f-6638-4446-85b2-6e4314eca938"},"page_content":"Unsupervised Multitask Learners,” 2019.\n[23]G. C. Bowker and S. L. Star, Sorting Things Out . MIT Press, Aug. 2000.\n[24]L. Weidinger, J. Uesato, M. Rauh, C. Griffin, P.-S. Huang, J. Mellor, A. Glaese, M. Cheng,\nB. Balle, A. Kasirzadeh, C. Biles, S. Brown, Z. Kenton, W. Hawkins, T. Stepleton, A. Birhane,\nL. A. Hendricks, L. Rimell, W. Isaac, J. Haas, S. Legassick, G. Irving, and I. Gabriel, “Taxonomy\nof Risks posed by Language Models,” in 2022 ACM Conference on Fairness, Accountability,\nand Transparency , FAccT ’22, (New York, NY, USA), pp. 214–229, Association for Computing\nMachinery, June 2022.\n72","type":"Document"}
query = "What were the compute requirements for training GPT 4"

results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.HYBRID,
)

# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1:  {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":97,"id":"36dfcd6c-d3cf-4e34-a5d6-cc4d63013cba"},"page_content":"Figure 11: Results on IF evaluations across GPT3.5, GPT3.5-Turbo, GPT-4-launch\n98","type":"Document"}
Score 1: 0.8173275975778744


Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":7,"id":"3d6e4715-4a38-40b1-89f1-e768bad5f9c8"},"page_content":"Preliminary results on a narrow set of academic vision benchmarks can be found in the GPT-4 blog\npost [ 65]. We plan to release more information about GPT-4’s visual capabilities in follow-up work.\n8","type":"Document"}
Score 2: 0.8176419674549597


Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":2,"id":"f2746fd3-bbcb-4197-b2d5-ee7b355b6009"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Score 3: 0.8053881702559759


Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 4: 0.8394796122122777


Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"20153a6c-7c2c-4328-9b0e-e3502d7ac4dd"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}
Score 5: 0.8213247840132897

Hybrid Search with filtering

query = "What were the compute requirements for training GPT 4"

pre_filter = {
"conditions": [
{
"property": "text",
"operator": "$full_text_contains_any",
"value": "compute requirements",
},
{"property": "metadata.page", "operator": "$eq", "value": 0},
],
"logical_operator": "$and",
}

results = vector_search.similarity_search_with_score(
query=query,
k=5,
query_type=CosmosDBQueryType.HYBRID,
)

# Display results
for i in range(0, len(results)):
print(f"Result {i+1}: ", results[i][0].json())
print(f"Score {i+1}: ", results[i][1])
print("\n")
Result 1:  {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":97,"id":"36dfcd6c-d3cf-4e34-a5d6-cc4d63013cba"},"page_content":"Figure 11: Results on IF evaluations across GPT3.5, GPT3.5-Turbo, GPT-4-launch\n98","type":"Document"}
Score 1: 0.8173275975778744


Result 2: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":7,"id":"3d6e4715-4a38-40b1-89f1-e768bad5f9c8"},"page_content":"Preliminary results on a narrow set of academic vision benchmarks can be found in the GPT-4 blog\npost [ 65]. We plan to release more information about GPT-4’s visual capabilities in follow-up work.\n8","type":"Document"}
Score 2: 0.8176419674549597


Result 3: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":2,"id":"f2746fd3-bbcb-4197-b2d5-ee7b355b6009"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Score 3: 0.8053881702559759


Result 4: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"9d59c3ed-deac-48cb-9382-a8ab079334e5"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it performs quite well and often outscores the vast majority of human test takers.","type":"Document"}
Score 4: 0.8394796122122777


Result 5: {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":1,"id":"20153a6c-7c2c-4328-9b0e-e3502d7ac4dd"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–\n10,000×less compute.\n3.1 Loss Prediction\nThe final loss of properly-trained large language models is thought to be well approximated by power\nlaws in the amount of compute used to train the model [41, 42, 2, 14, 15].\nTo verify the scalability of our optimization infrastructure, we predicted GPT-4’s final loss on our","type":"Document"}
Score 5: 0.8213247840132897

Was this page helpful?