Home

avatar

dkasjdkaslkdaksdjlk123

0
14 k
14 k

Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations

0
14 k
14 k

Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations

0
14 k
14 k

Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations

0
14 k
14 k

Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations

0
14 k
14 k

Write Sign up Sign in Long Text Summarization, RetrievalQA and Vector Databases -LangChain Arxiv Tutor RAGoon RAGoon · Follow 10 min read · Jul 7 64 1 In this second Article we will talk about semantic search and populate a Vector Database. Vector Database First note that you have many possibilities to perform similarity/semantic search across textual data. For example with ElasticSearch + BM25. Here we will embed our documents & queries with ada and use a Vector Database. We will Store all of our passages in a Vector Database. In vector Databases we can store alongside the paragraphs/passages, their associated embeddings (from any embedding model). To perform semantic search on those passages from a query. By embedding the query with the same embedding model, then retrieve the most similar passages through an Nearest Neighbor search. There are different Vector Databases that you must consider. Here we will use FAISS, from facebook. How does FAISS works (From a high Level) ? As we scale the number of Documents in our VectorDB, the search can be slower. Due to the utilization of Nearest Neighbor algorithm, which is a brute force algorithm (Computes the distance between the Query vector and every other vectors in the Vector Store). Why FAISS is efficient ? Parallelize the computes through GPUs Uses 3 steps : Normalize the vectors with : PCA & L2 Normalization Inverted File Index : Form clusters with sets of vectors, based on the similarity of those vectors. Thus we go from computing the distance between the query and each vector, to computing the distance between the query and each Cluster, then search in the minimum distance cluster. Note that this is a non-exhaustive search ! Quantization to decrease memory-size FAISS with LangChain As said previously we need a VectorStore and an embedding model from langchain.vectorstores import FAISS from langchain.embeddings import OpenAIEmbeddings os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY" embeddings = OpenAIEmbeddings() We Init the Vector Store with the documents : We do not include the Titles chunks as they won’t bring really useful info for Retrieval. vdb_chunks = FAISS.from_documents([doc for doc in docs if doc.metadata["category"] != "Title"], embeddings=embeddings) Add the adjacents paper’s content From the previously gathered adjacent paper’s Arxiv numbers we can add their content to the VectorDB : for pdf_number in adjacents_papers_numbers: docs = ArxivLoader(query=pdf_number) docs = PDFMinerLoader(f"papers/{pdf_number}.pdf").load() docs = text_splitter.split_documents(docs) vdb_chunks.add_documents(docs) Save the FAISS index vdb_chunks.save_local("vdb_chunks", index_name="base_and_adjacent") The VectorDB is now functional, we can retrieve the most similar documents based on a query. vdb_chunks = FAISS.load_local("vdb_chunks", embeddings, index_name="base_and_adjacent") vdb_chunks.as_retriever().get_relevant_documents("What are Knowledge Graphs ?") [Document(page_content='Knowledge graphs (KGs) store structured knowledge as a\ncollection of triples KG = {(h,r,t) C E x R x E}, where €\nand R respectively denote the set of entities and relations.\nExisting knowledge graphs (KGs) can be classified into four\ngroups based on the stored information: 1) encyclopedic KGs,\n2) commonsense KGs, 3) domain-specific KGs, and 4) multi-\nmodal KGs. We illustrate the examples of KGs of different\ncategories in Fig. [5]\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Introduction\nLarge-scale knowledge graphs (KG) such as FreeBase (Bol-\nlacker et al. 2008), YAGO (Suchanek, Kasneci, and Weikum\n2007) and WordNet (Miller 1995) provide effective basis for\nmany important AI tasks such as semantic search, recom-\nmendation (Zhang et al. 2016) and question answering (Cui\net al. 2017). A KG is typically a multi-relational graph con-\ntaining entities as nodes and relations as edges. Each edge\nis represented as a triplet (head entity, relation, tail entity)\n((h, r, t) for short), indicating the relation between two enti-\nties, e.g., (Steve Jobs, founded, Apple Inc.). Despite their ef-\nfectiveness, knowledge graphs are still far from being com-', metadata={'source': 'papers/1909.03193.pdf'}), Document(page_content='Encyclopedic knowledge graphs are the most ubiquitous\nKGs, which represent the general knowledge in real-world.\nEncyclopedic knowledge graphs are often constructed by\nintegrating information from diverse and extensive sources,\nincluding human experts, encyclopedias, and databases.\nWikidata is one of the most widely used encyclopedic\nknowledge graphs, which incorporates varieties of knowl-\nedge extracted from articles on Wikipedia. Other typical\nencyclopedic knowledge graphs, like Freebase (671, Dbpedia\n[68], and YAGO are also derived from Wikipedia. In\naddition, NELL is a continuously improving encyclope-\ndic knowledge graph, which automatically extracts knowl-\nedge from the web, and uses that knowledge to improve\nits performance over time. There are several encyclope-\ndic knowledge graphs available in languages other than\nEnglish such as CN-DBpedia and Vikidia 70}. The\nlargest knowledge graph, named Knowledge Occean (KO)\ncurrently contains 4,8784,3636 entities and 17,3115,8349\nrelations in both English and Chinese.\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'}), Document(page_content='Commonsense knowledge graphs formulate the knowledge\nabout daily concepts, e.g., objects, and events, as well\nas their relationships (Z}. Compared with encyclopedic\nknowledge graphs, commonsense knowledge graphs often\nmodel the tacit knowledge extracted from text such as (Car,\n', metadata={'page_number': 3, 'category': 'Text', 'source': 'papers/2306.08302.pdf'})] Retrieval Question Answering From this we can build a Retrieval Question Answering System. You can build this with farm.haystack and others, but here we will, of course, use LangChain. There is multiple types of question-answering systems, like Retriever-Reader or Generator Systems. With LangChain and LLMs we will build a Retriever-Generator system. The Retriever will get a query and find a set of relevant Documents from an external source (here it’s the FAISS VDB). The Generator (the LLM) will take the contexts and output an answer based on the contexts. A Reader would output the span that answers the question from the context. This makes the QA system called Open-Book, it has access to external data (source knowledge). An only Generator QA system would be Closed-Book, fully based on the generator’s (here the LLM) parametric-knowledge. RetrievalQA with LangChain To build a Reader-Generator QA system with LangChain is easy. First we define the llm we’ll use from langchain.llms import OpenAI llm = OpenAI(temperature=0.0) # Set temperature to 0.0 as we don't want creative answer Then init the RetrievalQA object : Here we specify chain_type="stuff" but you can also use map_reduce or refine. We will cover those for summarization. qa = RetrievalQAWithSourcesChain.from_chain_type(llm=llm, chain_type="stuff", retriever=vdb_chunks.as_retriever()) That’s it, we can use the qa chain : qa({"question" : "What are Knowledge Graphs"}) {'question': 'What are Knowledge Graphs', 'answer': ' Knowledge Graphs are multi-relational graphs containing entities as nodes and relations as edges, typically used for AI tasks such as semantic search, recommendation and question answering.\n', 'sources': 'papers/2306.08302.pdf, papers/1909.03193.pdf'} As we can see the qa chain uses 2 passages from 2 different papers to generate the answer. We can see the given prompt for this QA chain : Summarization Summarization with LangChain is more tricky than RetrievalQA. As RetrievalQA is dependent of the chunk size. Summarization with LangChain by default, is dependent of the whole text length. As we can see our text contains around 22000 tokens with GPT3.5. That means we cannot use the "stuff" summarization chain. As it passes the whole text as is. Of course we can use larger context-lengths models (GPT4, …), but you can still encounter the same problems with big documents. The first thing we can do, and we already started to do is exclude unnecessary text data (like the references), that got us from 28k to 22k tokens. We can also exclude Titles (even though, one can argue that it can be useful). While it’s still above the token limit, we decreased the cost to the API call. Using Map-Reduce Map-Reduce : Makes an API call to the LLM, with the same prompt for each chunk, to get a summary for each chunk. This is the Mapping Step. This step can be paralleled, as each doc is treated as Independent from one another. Every summary from the previous step is passed to a last LLM Call, that will output the final summary. This is the Reducing Step. With LangChain : summarize_map_reduce = load_summarize_chain(llm=llm, chain_type="map_reduce") Using Refine Refine works in the following way : We init an Initial Value (the global summary) 2. We iterate over the chunks and : at each time set the global summary to the summary of the chunk with the global summary. 3. Continue until we combined every documents The method can be seen as : summarize_chain = load_summarize_chain(llm) global_summary = "" # Init the summary value for doc in docs: # Combine each doc with the global summary global_summary = summarize_chain([global_summary, doc]) Using LangChain : summarize_refine = load_summarize_chain(llm=llm, chain_type="refine") But using these two tricks are still costly, and can take a lot of time for long documents, due to how they work. Extractive then Abstractive Summarization To resolve this we can use an extractive Summarization Algorithm and then get an abstractive summarization with GPT3.5 There’s a number of libraries and algorithms for extractive summarization (Luhn, PageRank, TextRank, …), I will present 2 of those, one is transformer-based and the other is not. LexRank LexRank is a Graph-Based summarization Algorithm, that is based on the fundamental idea that sentences “recommend” each other, like webpages link to each other on internet. Each sentence is considered as a node in a Graph represented by a similarity matrix. The algorithm scores each sentence by computing the eigenvector of the matrix. The highest centrality score’s sentences are kept for the summary. Let’s implement LexRank with Python through sumy from sumy.summarizers.lex_rank import LexRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer # We need to pass string not LangChain's Document # We only select "Text" labeled data as "Title" will much likely brings noise full_text = "".join(doc.page_content for doc in docs if doc.metadata["category"] == "Text") Init the sumy’s Tokenize, Parser and Summarizer tokenizer = Tokenizer("english") parser = PlaintextParser(full_text, tokenizer) lex_rank = LexRankSummarizer() We select perform an extractive summary of 40 sentences, then transform the output in a list of string most_important_sents = lex_rank(parser.document, sentences_count=40) most_important_sents = [str(sent) for sent in most_important_sents if str(sent)] Now we make an API call to GPT3.5 from langchain.prompts import PromptTemplate prompt = """You will be given a series of sentences from a paper. Your goal is to give a summary of the paper. The sentences will be enclosed in triple backtrips (```). sentences : ```{most_important_sents}``` SUMMARY :""" prompt_summarize_most_important_sents = PromptTemplate(template=prompt, input_variables=["most_important_sents"]) llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=most_important_sents)) We get the following output : """\nThis paper discusses the unification of Knowledge Graphs (KGs) and Large Language Models (LLMs) for knowledge representation. It outlines three frameworks for this unification, including KG-enhanced LLMs, LLM-augmented KGs, and Synergized LLMs + KGs. It also discusses the use of KGs for various tasks such as embedding, completion, construction, graph-to-text generation, and question answering. Additionally, it covers the use of medical knowledge graphs for medical knowledge representation and the alignment of knowledge in text corpus and KGs.""" This method is quite fast, and is less costly. BERT Extractive Summarizer BERT Extractive Summarizer is based on BERT, and works the following way : Embeds the sentences (through BERT) Uses a clustering Algorithm on the embeddings. The sentences that are “close” in the embedding vector space gets into the same cluster Selects the sentences that are closest to the centroid of each cluster. These sentences are considered as the most representative of each clusters. Coref Resolution Combine each selected sentences While we can implement it ourselves easily. There’s already a python library that takes care of it (bert-extractive-summarizer ) from summarizer import Summarizer model = Summarizer() result = model(full_text, num_sentences=40) # We specify a number of sentences Note that the model use by default is bert-large-uncased but you can specify any model (from huggingface, allennlp, etc…). Thus you can use domain-specific models, for specific types of Documents. Then run the results through an API Call : llm(prompt=prompt_summarize_most_important_sents.format(most_important_sents=result)) >>>"""'\nThis article presents a roadmap for unifying large language models (LLMs) and knowledge graphs (KGs) to leverage their respective strengths and overcome the limitations of each approach for various downstream tasks. It provides a categorization of the research in this field, including encoder and decoder modules, large-scale decoder-only LLMs, NELL, TransOMCS, CausalBanK, Copilot, New Bing, Shop.ai, and more. It also discusses methods to improve the interpretability of LLMs, and how LLMs and KGs can be integrated to address various real-world applications.'""" Embeddings + KMeans over chunks Note that we can apply the same concept with LangChain and KMeans over the chunk, and not the sentences this time. Using ada for long-context embeddings. import numpy as np from sklearn.cluster import KMeans # Get the ada's embeddings for each chunk embeds = embeddings.embed_documents([doc.page_content for doc in docs]) nclusters = 8 # Select a number of clusters # KMeans over the embeddings for Clustering kmeans = KMeans(n_clusters=nclusters).fit(embeds) We then get the closest indices to the centroid of each clusters : closest_to_centroid_indices = [] for i in range(nclusters): # Get the distances of each embedding and the cluster's centroid distances = np.linalg.norm(embeds - kmeans.cluster_centers_[i], axis=1) # Select the idx of the embedding with the minimum distance to the centroid closest_to_centroid_indices.append(np.argmin(distances)) closest_to_centroid_indices = sorted(closest_indices) Then perform a Mapping Step where we map each selected chunk through an API Call, to get a summary of each chunk. summary_prompt = """ You will be given a text. Give a concise and understanding summary. ```{text}``` CONCISE SUMMARY : """ summary_prompt_template = PromptTemplate(template=summary_prompt, input_variables=["text"]) summary_chain = load_summarize_chain(llm=llm, prompt=summary_prompt_template) chunk_summaries = [] for doc in [docs[idx] for idx in indices]: chunk_summaries.append(summary_chain.run([doc])) We then combine all those summaries into a final summary, through an API Call. This is the Reducing Step. summaries = Document(page_content="\n".join(summaries)) global_summary_promp = """ You will be given a series of summaries from a text. Your goal is to write a general summary from the given summaries. ```{text}``` SUMMARY: """ global_summary_prompt_template = PromptTemplate(template=global_summary_promp, input_variables=["text"]) global_summary_chain = load_summarize_chain(llm=llm, prompt=global_summary_prompt_template) summary = global_summary_chain.run([summaries]) '\nThis paper discusses the use of Language Models (LLMs) and Knowledge Graphs (KGs) to improve performance in various applications. It covers the synergy of LLMs and KGs from two perspectives: knowledge representation and reasoning, and provides a summary of representative works in a table. Encoder-decoder Long-Short Term Memory (LSTM) models are used to read the input sequence and generate the output sequence, while Encyclopedic Knowledge Graphs are used to store information in a structured way. LLM-augmented KGs are also discussed, as well as methods to integrate KGs for LLMs. Finally, two parts of a triple are encoded separately by LSTM models, and the representations of the two parts are used to predict the possibility of the triple using a scoring function.' You can improve this method by : Selecting a higher number of clusters Using a different clustering Algorithm The next article will focus on Tests Generation, to enhance the engagement, comprehension, and memorization. Gpt ChatGPT Langchain Deep Learning Data Science 64 1 RAGoon Written by RAGoon 59 Followers Passionated about Semantic Search, NLP, Deep Learning & Graphs Follow More from RAGoon LangChain Arxiv Tutor : Data Loading RAGoon RAGoon LangChain Arxiv Tutor : Data Loading This Series of Articles covers the usage of LangChain, to create an Arxiv Tutor. That will allow anyone to interact in different ways with… 8 min read · Jul 6 86 2 Training a Contextual Compression Model RAGoon RAGoon Training a Contextual Compression Model For RAG, a Contextual Compression step can ameliorate the results in multiple ways : 4 min read · Aug 28 22 Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine RAGoon RAGoon Synthetic Queries to Doc Dataset Generation — A News Semantic Search Engine This is the first article of a Series that will cover advanced Semantic Search Methods & Algorithms. In order to create a Newspaper… 17 min read · Aug 12 11 Training a Retriever on Synthetic Data RAGoon RAGoon Training a Retriever on Synthetic Data Today we are going to use the previously generated synthetic queries to train a retriever on specific type of documents. 6 min read · Aug 24 18 See all from RAGoon Recommended from Medium How to summarize text with OpenAI and LangChain Johni Douglas Marangon Johni Douglas Marangon How to summarize text with OpenAI and LangChain This is the first of three posts about my recent study of summarization using Large Language Models — LLM. 7 min read · Aug 17 51 1 🦜️✂️ Text Splitters: Smart Text Division with Langchain Gustavo Espíndola Gustavo Espíndola 🦜️✂️ Text Splitters: Smart Text Division with Langchain In the fascinating world of natural language processing, tools for transforming documents and splitting texts have become essential, thanks… 2 min read · Sep 5 6 Lists Image by vectorjuice on FreePik The New Chatbots: ChatGPT, Bard, and Beyond 12 stories · 233 saves This One Prompt Will 10X Your Chat GPT Results ChatGPT prompts 31 stories · 785 saves AI-generated image of a cute tiny robot in the backdrop of ChatGPT’s logo ChatGPT 23 stories · 308 saves What is ChatGPT? 9 stories · 248 saves How to Chunk Text Data — A Comparative Analysis Solano Todeschini Solano Todeschini in Towards Data Science How to Chunk Text Data — A Comparative Analysis Exploring and comparing distinct approaches to text chunking. 17 min read · Jul 20 478 6 Using langchain for Question Answering on own data Onkar Mishra Onkar Mishra Using langchain for Question Answering on own data Step-by-step guide to using langchain to chat with own data 23 min read · Aug 7 905 10 Chunking Strategies for LLM Applications Dr. Ernesto Lee Dr. Ernesto Lee Chunking Strategies for LLM Applications Making sense of the complex world of language model applications with the powerful tool of text chunking. 8 min read · Jun 29 23 Harnessing Retrieval Augmented Generation With Langchain Amogh Agastya Amogh Agastya in Better Programming Harnessing Retrieval Augmented Generation With Langchain Implementing RAG using Langchain 19 min read · 6 days ago 414 6 See more recommendations

0
14 k
14 k

Vado reprehenderit statua. Charisma creta decumbo iusto demum curo reiciendis carcer blanditiis. Quisquam volubilis conduco adipisci crustulum acceptus videlicet soluta. Astrum vitium corrupti vinum victoria cumque velociter vado calcar. Acervus recusandae utilis apostolus. Tener benigne corona.

578
14 k
14 k

Summopere vester subito. Trado textus degero abbas tenetur. Vetus odit deputo contego beatus summisse curto acies sub. Adulescens claudeo minima ante eos tempore sapiente cultura vulgo. Amicitia vicinus curvo usque labore sponte cupio perspiciatis.

869
14 k
14 k

Ultra bardus ad. Antea tepesco absque omnis stella cui venio velit coma.

96
14 k
14 k

Abduco summa qui beneficium. Saepe aspicio tempora acer tollo theologus. Tam ulciscor cur uxor decimus tepesco. Vos bestia comedo vesper alveus suggero ciminatio ambitus. Creator ascit audentia adeptio dicta surculus arbustum pel deputo. Avarus veniam adstringo coniuratio suscipio depromo.

899
14 k
14 k

Amplitudo comis abscido vesco. Talus supplanto damnatio quae animadverto utor advoco. Laborum ventosus totam reprehenderit summopere creator. Varietas vulticulus caelestis. Constans aqua cervus velit ago cohaero admoveo viridis voluptate quis. Defungo catena umbra vilicus adflicto adfero dolores.

205
14 k
14 k

Vicissitudo velit consequuntur cohaero modi subvenio coniuratio alo strenuus. Vehemens careo universe voco vesica cursim accusantium. Conduco apud benevolentia cauda tabella nisi. Vulticulus beatae certe supra.

327
14 k
14 k