99国产精品欧美一区二区三区,成人国产欧美大片一区

在本節(jié)中，我們關(guān)注代理AI如何從刮擦數(shù)據(jù)中提取相關(guān)的產(chǎn)品信息，以確保其已準(zhǔn)備好利益相關(guān)者的消費(fèi)。我們將分解涉及的關(guān)鍵步驟，從加載數(shù)據(jù)到處理數(shù)據(jù)，最后以結(jié)構(gòu)化格式保存結(jié)果。

代碼代碼片段：

>步驟2：提取原始文本內(nèi)容

>步驟1：設(shè)置環(huán)境

步驟4：生成文檔塊的嵌入

>步驟5：設(shè)置獵犬

>步驟6：創(chuàng)建提示模板

>一旦建立了公司情報系統(tǒng)，下一步就是部署和擴(kuò)展其供生產(chǎn)使用。您可以在AWS或GCP等云平臺上部署該系統(tǒng)以靈活性和可擴(kuò)展性，或者如果數(shù)據(jù)隱私為優(yōu)先級，則選擇本地解決方案。為了使系統(tǒng)更加用戶友好，請考慮構(gòu)建一個簡單的API或UI，允許用戶與平臺進(jìn)行交互并毫不費(fèi)力地檢索見解。隨著系統(tǒng)開始處理較大的數(shù)據(jù)集和更高的查詢負(fù)載，必須有效擴(kuò)展。

鑰匙要點(diǎn)

1。?在此設(shè)置中使用檢索增強(qiáng)生成（RAG）的目的是什么？ RAG通過將信息檢索與生成AI相結(jié)合，增強(qiáng)了AI模型提供上下文感知響應(yīng)的能力。它可以對大型數(shù)據(jù)集進(jìn)行更明智的查詢，從而更容易檢索精確的，相關(guān)的答案，而不僅僅是執(zhí)行基本的關(guān)鍵字搜索。

首頁

科技周邊

人工智能

在組織中建立用于智能決策的破布系統(tǒng)

Jack chen

Mar 07, 2025 am 09:11 AM

在當(dāng)今快節(jié)奏的商業(yè)環(huán)境中，組織被驅(qū)動決策，優(yōu)化運(yùn)營并保持競爭力的數(shù)據(jù)所淹沒。但是，從這些數(shù)據(jù)中提取可行的見解仍然是一個重大障礙。與代理AI集成時，檢索功能增強(qiáng)的一代（RAG）系統(tǒng)不僅可以通過檢索相關(guān)信息，還可以實(shí)時處理和交付上下文感知的見解來應(yīng)對這一挑戰(zhàn)。這種組合允許企業(yè)創(chuàng)建智能代理，以自主查詢數(shù)據(jù)集，適應(yīng)和提取有關(guān)產(chǎn)品功能，集成和操作的見解。

>通過將抹布與代理AI合并，企業(yè)可以增強(qiáng)決策并將分散的數(shù)據(jù)轉(zhuǎn)換為有價值的智能。該博客探討了使用Agentic AI構(gòu)建RAG管道的過程，提供技術(shù)見解和代碼示例，以增強(qiáng)組織中明智的決策能力。

學(xué)習(xí)目標(biāo)

>學(xué)習(xí)如何使用Python和刮擦工具自動從多個Web來源提取和刮擦相關(guān)數(shù)據(jù)，從而為任何公司智能平臺構(gòu)成基礎(chǔ)。
通過提取諸如產(chǎn)品功能，集成和使用AI驅(qū)動的技術(shù)諸如產(chǎn)品功能，集成和故障排除步驟之類的關(guān)鍵點(diǎn)，了解如何將數(shù)據(jù)構(gòu)造和處理刮擦數(shù)據(jù)。
>

>本文是> > data Science Blogathon的一部分。內(nèi)容表

>使用BFS提取數(shù)據(jù)，并用AI Agent

import requests from bs4 import BeautifulSoup from collections import deque # Function to extract links using BFS def bfs_link_extraction(start_url, max_depth=3): visited = set() # To track visited links queue = deque([(start_url, 0)]) # Queue to store URLs and current depth all_links = [] while queue: url, depth = queue.popleft() if depth > max_depth: continue # Fetch the content of the URL try: response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract all links in the page links = soup.find_all('a', href=True) for link in links: full_url = link['href'] if full_url.startswith('http') and full_url not in visited: visited.add(full_url) queue.append((full_url, depth + 1)) all_links.append(full_url) except requests.exceptions.RequestException as e: print(f"Error fetching {url}: {e}") return all_links # Start the BFS from the homepage start_url = 'https://www.example.com' # Replace with the actual homepage URL all_extracted_links = bfs_link_extraction(start_url) print(f"Extracted {len(all_extracted_links)} links.")
使用BFS提取數(shù)據(jù)并刮擦數(shù)據(jù)
>為公司情報構(gòu)建強(qiáng)大的抹布系統(tǒng)的第一步是收集必要的數(shù)據(jù)。由于數(shù)據(jù)可能來自各種網(wǎng)絡(luò)來源，因此有效地刮擦和組織它是關(guān)鍵。發(fā)現(xiàn)和收集相關(guān)頁面的一種有效技術(shù)是廣度優(yōu)先搜索（BFS）。 BFS幫助我們遞歸地發(fā)現(xiàn)從主頁開始的鏈接，從而逐漸將搜索擴(kuò)展到更深的級別。這樣可以確保我們收集所有相關(guān)頁面，而不會用不必要的數(shù)據(jù)壓倒系統(tǒng)。在本節(jié)中，我們將研究如何使用BFS從網(wǎng)站提取鏈接，然后將這些頁面的內(nèi)容刪除。使用BFS，我們會系統(tǒng)地遍歷網(wǎng)站，收集數(shù)據(jù)并創(chuàng)建一個有意義的數(shù)據(jù)集，用于在RAG管道中處理。

步驟1：使用BFS
鏈接提取為了開始，我們需要從給定網(wǎng)站收集所有相關(guān)鏈接。使用BFS，我們可以探索主頁上的鏈接，然后從那里遵循其他頁面上的鏈接，直到指定的深度。此方法可確保我們捕獲所有可能包含相關(guān)公司數(shù)據(jù)的必要頁面，例如產(chǎn)品功能，集成或其他關(guān)鍵細(xì)節(jié)。
下面的代碼使用BFS從啟動URL中進(jìn)行鏈接提取。它首先獲取主頁，提取所有鏈接（＆lt; a＆gt;帶有HREF屬性的標(biāo)簽），然后遵循這些鏈接到后續(xù)頁面，遞歸根據(jù)給定深度限制。
這是執(zhí)行鏈接提取的代碼：>
Extracted 1500 links.

>我們保持隊(duì)列以跟蹤訪問的URL及其相應(yīng)的深度，以確保有效的遍歷。訪問的集合用于防止多次重新訪問相同的URL。對于每個URL，我們使用BeautifulSoup來解析HTML并提取所有鏈接（帶有HREF屬性的標(biāo)簽）。該過程使用BFS遍歷，遞歸獲取每個URL的內(nèi)容，提取鏈接并進(jìn)一步探索直到達(dá)到深度極限。這種方法可確保我們在沒有冗余的情況下有效地探索網(wǎng)絡(luò)。
> >輸出此代碼輸出從網(wǎng)站提取的鏈接列表，直到指定的深度。 >輸出表明該系統(tǒng)從啟動網(wǎng)站及其鏈接的頁面最高為3的鏈接中找到并收集了1500個鏈接。您將用實(shí)際的目標(biāo)URL替換https://www.example.com。以下是原始代碼的輸出屏幕截圖。敏感信息已被掩蓋以維持完整性。
>步驟2：從提取的鏈接中刮擦數(shù)據(jù)

>使用BFS提取相關(guān)鏈接后，下一步就是從這些頁面中刮擦內(nèi)容。我們將尋找關(guān)鍵信息，例如產(chǎn)品功能，集成和任何其他相關(guān)數(shù)據(jù)，這些數(shù)據(jù)將幫助我們?yōu)镽AG系統(tǒng)構(gòu)建結(jié)構(gòu)化數(shù)據(jù)集。
在此步驟中，我們循環(huán)瀏覽提取的鏈接列表和刮擦密鑰內(nèi)容，例如頁面標(biāo)題及其主要內(nèi)容。您可以根據(jù)需要調(diào)整此代碼以刮擦其他數(shù)據(jù)點(diǎn)（例如，產(chǎn)品功能，定價或常見問題解答信息）。
>

import requests from bs4 import BeautifulSoup from collections import deque # Function to extract links using BFS def bfs_link_extraction(start_url, max_depth=3): visited = set() # To track visited links queue = deque([(start_url, 0)]) # Queue to store URLs and current depth all_links = [] while queue: url, depth = queue.popleft() if depth > max_depth: continue # Fetch the content of the URL try: response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract all links in the page links = soup.find_all('a', href=True) for link in links: full_url = link['href'] if full_url.startswith('http') and full_url not in visited: visited.add(full_url) queue.append((full_url, depth + 1)) all_links.append(full_url) except requests.exceptions.RequestException as e: print(f"Error fetching {url}: {e}") return all_links # Start the BFS from the homepage start_url = 'https://www.example.com' # Replace with the actual homepage URL all_extracted_links = bfs_link_extraction(start_url) print(f"Extracted {len(all_extracted_links)} links.")
>對于列表中的每個URL，我們發(fā)送HTTP請求以獲取頁面的內(nèi)容，并使用BeautifulSoup解析其提取標(biāo)題和主要內(nèi)容。我們將提取的數(shù)據(jù)存儲在字典列表中，每個詞典都包含URL，標(biāo)題和內(nèi)容。最后，我們將刮擦數(shù)據(jù)保存到JSON文件中，以確?？梢栽赗AG管道中進(jìn)行以后的處理。此過程可確保有效收集和存儲相關(guān)數(shù)據(jù)以進(jìn)一步使用。
>
>輸出

>該代碼的輸出將是一個保存的JSON文件（Scraped_data.json），其中包含來自鏈接的刮擦數(shù)據(jù)。數(shù)據(jù)結(jié)構(gòu)的一個示例可能是這樣的：

Extracted 1500 links.
此JSON文件包含我們刮擦的每個頁面的URL，標(biāo)題和內(nèi)容?，F(xiàn)在可以將這些結(jié)構(gòu)化數(shù)據(jù)用于進(jìn)一步處理，例如在抹布系統(tǒng)中嵌入生成和提問。以下是原始代碼的輸出屏幕截圖。敏感信息已被掩蓋以維持完整性。

用AI代理自動化信息提取
> 在上一節(jié)中，我們使用廣度優(yōu)先搜索（BFS）策略介紹了刮擦鏈接和收集原始Web內(nèi)容的過程。一旦刮擦了必要的數(shù)據(jù)，我們就需要一個可靠的系統(tǒng)來組織和從此原始內(nèi)容中提取可行的見解。在這里，Agesic AI介入：通過處理刮擦數(shù)據(jù)，它會自動將信息構(gòu)造為有意義的部分。>
在本節(jié)中，我們關(guān)注代理AI如何從刮擦數(shù)據(jù)中提取相關(guān)的產(chǎn)品信息，以確保其已準(zhǔn)備好利益相關(guān)者的消費(fèi)。我們將分解涉及的關(guān)鍵步驟，從加載數(shù)據(jù)到處理數(shù)據(jù)，最后以結(jié)構(gòu)化格式保存結(jié)果。

>步驟1：加載刮擦數(shù)據(jù)

>此過程的第一步是將原始的刮擦內(nèi)容加載到我們的系統(tǒng)中。正如我們之前看到的，刮擦數(shù)據(jù)以JSON格式存儲，每個條目都包含一個URL和相關(guān)內(nèi)容。我們需要確保此數(shù)據(jù)以適當(dāng)?shù)母袷焦〢I處理。

代碼代碼片段：

import requests from bs4 import BeautifulSoup from collections import deque # Function to extract links using BFS def bfs_link_extraction(start_url, max_depth=3): visited = set() # To track visited links queue = deque([(start_url, 0)]) # Queue to store URLs and current depth all_links = [] while queue: url, depth = queue.popleft() if depth > max_depth: continue # Fetch the content of the URL try: response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract all links in the page links = soup.find_all('a', href=True) for link in links: full_url = link['href'] if full_url.startswith('http') and full_url not in visited: visited.add(full_url) queue.append((full_url, depth + 1)) all_links.append(full_url) except requests.exceptions.RequestException as e: print(f"Error fetching {url}: {e}") return all_links # Start the BFS from the homepage start_url = 'https://www.example.com' # Replace with the actual homepage URL all_extracted_links = bfs_link_extraction(start_url) print(f"Extracted {len(all_extracted_links)} links.")

>在這里，我們使用Python的內(nèi)置JSON庫將整個數(shù)據(jù)集加載到存儲器中。數(shù)據(jù)集中的每個條目都包含源的URL和text_content字段，該字段容納了原始的刮擦文本。此內(nèi)容是我們將在接下來的一步中處理的內(nèi)容。
>
>步驟2：提取原始文本內(nèi)容

接下來，我們通過數(shù)據(jù)集迭代以提取每個條目的相關(guān)text_content。這樣可以確保我們僅處理包含必要內(nèi)容的有效條目。跳過無效或不完整的條目以維持該過程的完整性。

代碼代碼片段：
在這一點(diǎn)上，Input_text變量包含我們將發(fā)送到AI模型以進(jìn)行進(jìn)一步處理的原始文本內(nèi)容。至關(guān)重要的是，我們在處理每個條目之前確保存在必要的鍵。
Extracted 1500 links.
步驟3：將數(shù)據(jù)發(fā)送給AI代理進(jìn)行處理

提取原始內(nèi)容后，我們將其發(fā)送到用于結(jié)構(gòu)化提取的代理AI模型。我們與GROQ API進(jìn)行交互，以根據(jù)預(yù)定義的提示請求結(jié)構(gòu)化的見解。 AI模型處理內(nèi)容并返回有組織的信息，這些信息涵蓋了關(guān)鍵方面，例如產(chǎn)品功能，集成和故障排除步驟。

代碼代碼片段：

>在這里，該代碼啟動了對GROQ的API調(diào)用，將Input_Text和說明作為消息有效負(fù)載的一部分發(fā)送。系統(tǒng)消息指示執(zhí)行確切任務(wù)的AI模型，而用戶消息則提供要處理的內(nèi)容。我們使用溫度，max_tokens和top_p參數(shù)來控制生成的輸出的隨機(jī)性和長度。

import json # Function to scrape and extract data from the URLs def scrape_data_from_links(links): scraped_data = [] for link in links: try: response = requests.get(link) soup = BeautifulSoup(response.content, 'html.parser') # Example: Extract 'title' and 'content' (modify according to your needs) title = soup.find('title').get_text() content = soup.find('div', class_='content').get_text() # Adjust selector # Store the extracted data scraped_data.append({ 'url': link, 'title': title, 'content': content }) except requests.exceptions.RequestException as e: print(f"Error scraping {link}: {e}") return scraped_data # Scrape data from the extracted links scraped_contents = scrape_data_from_links(all_extracted_links) # Save scraped data to a JSON file with open('/content/scraped_data.json', 'w') as outfile: json.dump(scraped_contents, outfile, indent=4) print("Data scraping complete.")
API調(diào)用配置：

模型：
指定要使用的模型。在這種情況下，選擇語言模型以確保它可以處理文本數(shù)據(jù)并生成響應(yīng)。
溫度：>控制響應(yīng)的創(chuàng)造力。更高的價值會導(dǎo)致更具創(chuàng)造性的響應(yīng)，而較低的價值使它們更具確定性。

max_tokens：設(shè)置生成的響應(yīng)的最大長度。

top_p：>確定令牌選擇的累積概率分布，控制響應(yīng)中的多樣性。步驟4：處理和收集結(jié)果
> AI模型處理內(nèi)容后，它將返回大量結(jié)構(gòu)化信息。我們收集并加入這些塊以創(chuàng)建完整的結(jié)果，確保不會丟失數(shù)據(jù)并且最終輸出完成。
代碼代碼片段：
import requests from bs4 import BeautifulSoup from collections import deque # Function to extract links using BFS def bfs_link_extraction(start_url, max_depth=3): visited = set() # To track visited links queue = deque([(start_url, 0)]) # Queue to store URLs and current depth all_links = [] while queue: url, depth = queue.popleft() if depth > max_depth: continue # Fetch the content of the URL try: response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract all links in the page links = soup.find_all('a', href=True) for link in links: full_url = link['href'] if full_url.startswith('http') and full_url not in visited: visited.add(full_url) queue.append((full_url, depth + 1)) all_links.append(full_url) except requests.exceptions.RequestException as e: print(f"Error fetching {url}: {e}") return all_links # Start the BFS from the homepage start_url = 'https://www.example.com' # Replace with the actual homepage URL all_extracted_links = bfs_link_extraction(start_url) print(f"Extracted {len(all_extracted_links)} links.")

此代碼段將每個塊的內(nèi)容串聯(lián)到> pm_points 變量，從而產(chǎn)生完整的結(jié)構(gòu)化洞察集。它以利益相關(guān)者可以輕松消費(fèi)或用于進(jìn)一步分析的格式提取這些見解。以下是原始代碼的輸出屏幕截圖，并掩蓋了敏感信息以保持完整性。
>步驟5：錯誤處理和維護(hù)數(shù)據(jù)完整性

在處理時，總是有可能遇到錯誤的可能性，例如不完整的內(nèi)容或網(wǎng)絡(luò)問題。通過使用錯誤處理機(jī)制，我們確保該過程對于所有有效條目都可以順利進(jìn)行。
代碼代碼片段：

這個嘗試 - 外觀塊捕獲并記錄所有錯誤，從而確保系統(tǒng)繼續(xù)處理其他條目。如果特定條目引起問題，則該系統(tǒng)將其標(biāo)記用于審核而無需停止整體過程。
Extracted 1500 links.
步驟6：保存處理的數(shù)據(jù)
AI處理內(nèi)容并返回結(jié)構(gòu)化的見解后，最后一步是保存此數(shù)據(jù)以供以后使用。我們將結(jié)構(gòu)化結(jié)果寫回JSON文件，以確保每個條目都有自己的已處理信息存儲以進(jìn)行進(jìn)一步分析。

代碼代碼片段：

>該代碼可有效存儲處理后的數(shù)據(jù)，并允許以后輕松訪問。它及其各自的結(jié)構(gòu)化點(diǎn)保存了每個條目，從而簡單地檢索提取的信息并分析。 >輸出
運(yùn)行上述代碼后，處理后的JSON文件將包含每個條目的提取點(diǎn)。 pm_points將持有與產(chǎn)品功能，集成，故障排除步驟等相關(guān)的結(jié)構(gòu)化信息，并準(zhǔn)備進(jìn)一步分析或集成到您的工作流程中。
import json # Function to scrape and extract data from the URLs def scrape_data_from_links(links): scraped_data = [] for link in links: try: response = requests.get(link) soup = BeautifulSoup(response.content, 'html.parser') # Example: Extract 'title' and 'content' (modify according to your needs) title = soup.find('title').get_text() content = soup.find('div', class_='content').get_text() # Adjust selector # Store the extracted data scraped_data.append({ 'url': link, 'title': title, 'content': content }) except requests.exceptions.RequestException as e: print(f"Error scraping {link}: {e}") return scraped_data # Scrape data from the extracted links scraped_contents = scrape_data_from_links(all_extracted_links) # Save scraped data to a JSON file with open('/content/scraped_data.json', 'w') as outfile: json.dump(scraped_contents, outfile, indent=4) print("Data scraping complete.")
>

以下是原始代碼的輸出屏幕截圖。敏感信息已被掩蓋以維持完整性。

[ { "url": "https://www.example.com/page1", "title": "Page 1 Title", "content": "This is the content of the first page. It contains information about integrations and features." }, { "url": "https://www.example.com/page2", "title": "Page 2 Title", "content": "Here we describe the functionalities of the product. It includes various use cases and capabilities." } ]
>檢索增強(qiáng)的生成管道實(shí)施
在上一節(jié)中，我們專注于從網(wǎng)頁中提取數(shù)據(jù)，并將其轉(zhuǎn)換為JSON等結(jié)構(gòu)化格式。我們還實(shí)施了提取和清潔相關(guān)數(shù)據(jù)的技術(shù)，從而使我們能夠生成一個可以進(jìn)行更深入分析的數(shù)據(jù)集。
>在本節(jié)中，我們將實(shí)施一個檢索功能（RAG）管道，該管道結(jié)合了文檔檢索和語言模型生成，以根據(jù)提取的信息回答問題。通過整合我們先前刮擦和處理的結(jié)構(gòu)化數(shù)據(jù)，該破布管道不僅將檢索最相關(guān)的文檔塊，而且還會基于該上下文產(chǎn)生準(zhǔn)確，有見地的響應(yīng)。
>步驟1：設(shè)置環(huán)境

開始，讓我們安裝抹布管道的所有必要依賴項(xiàng)：>

import requests from bs4 import BeautifulSoup from collections import deque # Function to extract links using BFS def bfs_link_extraction(start_url, max_depth=3): visited = set() # To track visited links queue = deque([(start_url, 0)]) # Queue to store URLs and current depth all_links = [] while queue: url, depth = queue.popleft() if depth > max_depth: continue # Fetch the content of the URL try: response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract all links in the page links = soup.find_all('a', href=True) for link in links: full_url = link['href'] if full_url.startswith('http') and full_url not in visited: visited.add(full_url) queue.append((full_url, depth + 1)) all_links.append(full_url) except requests.exceptions.RequestException as e: print(f"Error fetching {url}: {e}") return all_links # Start the BFS from the homepage start_url = 'https://www.example.com' # Replace with the actual homepage URL all_extracted_links = bfs_link_extraction(start_url) print(f"Extracted {len(all_extracted_links)} links.")
這些軟件包對于在Langchain中集成文檔處理，矢量化和OpenAI模型至關(guān)重要。 JQ是輕量級的JSON處理器，而Langchain則是構(gòu)建語言模型管道的核心框架。 Langchain-Openai促進(jìn)了諸如GPT之類的OpenAI模型的整合，Langchain-Chroma提供了基于色度的矢量商店，用于管理文檔嵌入。
>此外，我們使用句子轉(zhuǎn)換器來生成具有預(yù)訓(xùn)練的變壓器模型的文本嵌入，從而實(shí)現(xiàn)有效的文檔處理和檢索。
>
步驟2：加載提取的數(shù)據(jù)

現(xiàn)在，我們將加載使用JSONLOADER在上一節(jié)中提取和處理的結(jié)構(gòu)化數(shù)據(jù)。例如，這些數(shù)據(jù)本可以從網(wǎng)頁中刮掉為結(jié)構(gòu)化的JSON，其中鍵值對與特定主題或問題有關(guān)。
在此步驟中，已加載了先前提取的數(shù)據(jù)（可能包含產(chǎn)品功能，集成和功能）以進(jìn)行進(jìn)一步處理。
。
步驟3：將文檔分成較小的塊
Extracted 1500 links.

>現(xiàn)在我們有了原始數(shù)據(jù)，我們使用recursivecharactertextsplitter將文檔分解為較小的塊。這樣可以確保沒有單個塊超過語言模型的令牌限制。>

步驟4：生成文檔塊的嵌入

現(xiàn)在，我們使用sencencetransformer將每個文本塊轉(zhuǎn)換為嵌入式。這些嵌入代表了高維矢量空間中文本的含義，這對于以后搜索和檢索相關(guān)文檔很有用。

import json # Function to scrape and extract data from the URLs def scrape_data_from_links(links): scraped_data = [] for link in links: try: response = requests.get(link) soup = BeautifulSoup(response.content, 'html.parser') # Example: Extract 'title' and 'content' (modify according to your needs) title = soup.find('title').get_text() content = soup.find('div', class_='content').get_text() # Adjust selector # Store the extracted data scraped_data.append({ 'url': link, 'title': title, 'content': content }) except requests.exceptions.RequestException as e: print(f"Error scraping {link}: {e}") return scraped_data # Scrape data from the extracted links scraped_contents = scrape_data_from_links(all_extracted_links) # Save scraped data to a JSON file with open('/content/scraped_data.json', 'w') as outfile: json.dump(scraped_contents, outfile, indent=4) print("Data scraping complete.")
> sencencetransformer用于生成文本塊的嵌入，創(chuàng)建捕獲語義信息的密集矢量表示。函數(shù)embed_documents處理多個文檔并返回其嵌入，而Embed_query生成了用戶查詢的嵌入式。 Chroma是一個矢量存儲，管理這些嵌入并根據(jù)相似性實(shí)現(xiàn)有效的檢索，從而允許快速準(zhǔn)確的文檔或查詢匹配。
>步驟5：設(shè)置獵犬

>現(xiàn)在我們配置了獵犬。該組件根據(jù)用戶查詢搜索最相關(guān)的文本塊。它將最類似的文檔塊檢索到查詢中。
>
import requests from bs4 import BeautifulSoup from collections import deque # Function to extract links using BFS def bfs_link_extraction(start_url, max_depth=3): visited = set() # To track visited links queue = deque([(start_url, 0)]) # Queue to store URLs and current depth all_links = [] while queue: url, depth = queue.popleft() if depth > max_depth: continue # Fetch the content of the URL try: response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # Extract all links in the page links = soup.find_all('a', href=True) for link in links: full_url = link['href'] if full_url.startswith('http') and full_url not in visited: visited.add(full_url) queue.append((full_url, depth + 1)) all_links.append(full_url) except requests.exceptions.RequestException as e: print(f"Error fetching {url}: {e}") return all_links # Start the BFS from the homepage start_url = 'https://www.example.com' # Replace with the actual homepage URL all_extracted_links = bfs_link_extraction(start_url) print(f"Extracted {len(all_extracted_links)} links.")

獵犬使用相似性搜索來從矢量商店找到最相關(guān)的塊。

參數(shù)k = 6意味著它將返回與查詢最相關(guān)的前6個塊。

>步驟6：創(chuàng)建提示模板

接下來，我們創(chuàng)建一個提示模板，該模板將格式化語言模型的輸入。該模板既包含上下文（檢索到的塊）和用戶的查詢，從而指導(dǎo)模型以僅基于提供的上下文生成答案。

Extracted 1500 links.

> ChatPromptTemplate格式格式化模型的輸入，以強(qiáng)調(diào)答案僅基于給定上下文的需求。
{context}將被相關(guān)的文本塊替換，{Quartion}將被用戶的查詢替換。

>步驟7：設(shè)置語言模型
在此步驟中，我們初始化OpenAI GPT模型。該模型將根據(jù)獵犬提供的結(jié)構(gòu)化上下文生成答案。
>

import json # Function to scrape and extract data from the URLs def scrape_data_from_links(links): scraped_data = [] for link in links: try: response = requests.get(link) soup = BeautifulSoup(response.content, 'html.parser') # Example: Extract 'title' and 'content' (modify according to your needs) title = soup.find('title').get_text() content = soup.find('div', class_='content').get_text() # Adjust selector # Store the extracted data scraped_data.append({ 'url': link, 'title': title, 'content': content }) except requests.exceptions.RequestException as e: print(f"Error scraping {link}: {e}") return scraped_data # Scrape data from the extracted links scraped_contents = scrape_data_from_links(all_extracted_links) # Save scraped data to a JSON file with open('/content/scraped_data.json', 'w') as outfile: json.dump(scraped_contents, outfile, indent=4) print("Data scraping complete.")
我們初始化Chatopenai模型，該模型將處理提示并生成答案。>
>我們使用較小的模型“ GPT-4O-Mini”進(jìn)行有效的處理，盡管較大的模型可以用于更復(fù)雜的任務(wù)。

步驟8：構(gòu)建抹布管道

>在這里，我們將所有組件（Retriever，Prest，llm）集成到一個內(nèi)聚的RAG管道中。該管道將??進(jìn)行查詢，檢索相關(guān)上下文，通過模型并生成響應(yīng)。

> RunnablePassThrough確保查詢直接傳遞到提示。
> stroutputparser用于清潔和格式化模型的輸出為字符串格式。

[ { "url": "https://www.example.com/page1", "title": "Page 1 Title", "content": "This is the content of the first page. It contains information about integrations and features." }, { "url": "https://www.example.com/page2", "title": "Page 2 Title", "content": "Here we describe the functionalities of the product. It includes various use cases and capabilities." } ]

>步驟9：測試抹布管道
最后，我們使用各種用戶查詢來測試管道。對于每個查詢，系統(tǒng)都會檢索相關(guān)文檔塊，通過語言模型將它們傳遞并生成響應(yīng)。

系統(tǒng)通過每個查詢進(jìn)行迭代，調(diào)用管道并打印生成的答案。

對于每個查詢，模型都會處理檢索到的上下文，并提供了基于上下文的答案。
下面是原始代碼的抹布輸出的屏幕截圖。敏感信息已被掩蓋以維持完整性。

import json # Load the scraped JSON file containing the web data with open('/content/scraped_contents_zluri_all_links.json', 'r') as file: data = json.load(file)

通過結(jié)合Web刮擦，數(shù)據(jù)提取和高級檢索功能（RAG）技術(shù)，我們?yōu)楣局悄軇?chuàng)建了一個強(qiáng)大而可擴(kuò)展的框架。提取鏈接和刮擦數(shù)據(jù)的第一步確保我們從網(wǎng)絡(luò)中收集相關(guān)和最新信息。第二部分重點(diǎn)是查明與產(chǎn)品相關(guān)的特定細(xì)節(jié)，從而更容易對數(shù)據(jù)進(jìn)行分類和處理數(shù)據(jù)。

>最后，利用抹布使我們能夠通過從廣泛的數(shù)據(jù)集中檢索和綜合上下文信息來動態(tài)響應(yīng)復(fù)雜的查詢。這些組件共同構(gòu)成了一個綜合設(shè)置，可用于構(gòu)建能夠收集，處理和提供可行的公司的代理平臺。該框架可以成為開發(fā)高級情報系統(tǒng)的基礎(chǔ)，使組織能夠自動進(jìn)行競爭分析，監(jiān)控市場趨勢并了解其行業(yè)。
>部署和縮放

>一旦建立了公司情報系統(tǒng)，下一步就是部署和擴(kuò)展其供生產(chǎn)使用。您可以在AWS或GCP等云平臺上部署該系統(tǒng)以靈活性和可擴(kuò)展性，或者如果數(shù)據(jù)隱私為優(yōu)先級，則選擇本地解決方案。為了使系統(tǒng)更加用戶友好，請考慮構(gòu)建一個簡單的API或UI，允許用戶與平臺進(jìn)行交互并毫不費(fèi)力地檢索見解。隨著系統(tǒng)開始處理較大的數(shù)據(jù)集和更高的查詢負(fù)載，必須有效擴(kuò)展。

這可以通過利用分布式向量存儲和優(yōu)化檢索過程來實(shí)現(xiàn)，從而確保管道在大量使用下保持響應(yīng)迅速和快速。借助正確的基礎(chǔ)架構(gòu)和優(yōu)化技術(shù)，代理平臺可以增長以支持大規(guī)模運(yùn)營，實(shí)現(xiàn)實(shí)時見解并保持公司情報的競爭優(yōu)勢。
>
結(jié)論
在當(dāng)今數(shù)據(jù)驅(qū)動的世界中，
從非結(jié)構(gòu)化公司數(shù)據(jù)中提取可行的見解至關(guān)重要。檢索功能生成（RAG）系統(tǒng)結(jié)合了數(shù)據(jù)刮擦，指針提取和智能查詢，以創(chuàng)建一個強(qiáng)大的公司情報平臺。通過組織關(guān)鍵信息并實(shí)現(xiàn)實(shí)時，特定于上下文的響應(yīng)，RAG系統(tǒng)賦予組織中智能決策的能力，幫助企業(yè)做出數(shù)據(jù)支持，適應(yīng)性的決策。
這種可擴(kuò)展的解決方案隨您的需求而生長，處理復(fù)雜的查詢和較大的數(shù)據(jù)集，同時保持準(zhǔn)確性。借助正確的基礎(chǔ)設(shè)施，這個AI驅(qū)動的平臺成為了更智能運(yùn)營的基石，使組織能夠利用其數(shù)據(jù)，保持競爭力并通過組織中的智能決策來推動創(chuàng)新。

鑰匙要點(diǎn)

通過從多個來源啟用自動，有效的數(shù)據(jù)收集，以最小的努力啟用自動，有效的數(shù)據(jù)收集，從而增強(qiáng)了公司的智能。

提取關(guān)鍵數(shù)據(jù)點(diǎn)將非結(jié)構(gòu)化的內(nèi)容轉(zhuǎn)換為有組織的，可行的知識，增強(qiáng)了對AI驅(qū)動的見解的公司智能。

>將抹布與自定義矢量商店相結(jié)合并優(yōu)化的檢索器可實(shí)現(xiàn)智能的，上下文感知的響應(yīng)，以獲得更好的決策。
基于云的解決方案和分布式向量商店確保有效縮放，處理較大的數(shù)據(jù)集和查詢負(fù)載而不會損失績效。
rag管道處理實(shí)時查詢，直接從知識庫中傳遞準(zhǔn)確的，按需的見解。

獎勵：在以下鏈接中提供了此處討論的所有代碼。共有4個筆記本，每個筆記本都有自我解釋的名稱。隨時探索，發(fā)展和革新企業(yè)！
常見問題
q
1。?在此設(shè)置中使用檢索增強(qiáng)生成（RAG）的目的是什么？ RAG通過將信息檢索與生成AI相結(jié)合，增強(qiáng)了AI模型提供上下文感知響應(yīng)的能力。它可以對大型數(shù)據(jù)集進(jìn)行更明智的查詢，從而更容易檢索精確的，相關(guān)的答案，而不僅僅是執(zhí)行基本的關(guān)鍵字搜索。
q 2。?需要哪些工具和庫來構(gòu)建博客中描述的系統(tǒng)？使用的主要工具和庫包括Python，用于網(wǎng)絡(luò)刮擦的美麗套件，用于管理文檔檢索的Langchain，用于自然語言處理的OpenAI模型以及用于存儲矢量化文檔的Chroma。這些組件共同創(chuàng)建一個全面的公司情報平臺。?指針提取過程如何在此系統(tǒng)中起作用？指針提取涉及從刮擦內(nèi)容中識別特定信息，例如產(chǎn)品功能，集成和故障排除提示。數(shù)據(jù)是使用及時驅(qū)動系統(tǒng)處理的，該系統(tǒng)將信息組織到結(jié)構(gòu)化的，可行的見解中。這是通過AI模型和自定義提示的組合來實(shí)現(xiàn)的。Q 4。?RAG和AI代理如何改善公司的智能？ RAG和AI代理通過自動化數(shù)據(jù)檢索，處理和分析來增強(qiáng)公司智能
，使企業(yè)能夠提取實(shí)時，可行的見解。為什么數(shù)據(jù)刮擦對于公司情報很重要？數(shù)據(jù)刮擦有助于通過從多個來源收集和構(gòu)造有價值的信息來建立強(qiáng)大的公司智能
系統(tǒng)以進(jìn)行明智的決策。
>本文所示的媒體不歸Analytics Vidhya擁有，并由作者的酌情決定使用。

以上是在組織中建立用于智能決策的破布系統(tǒng)的詳細(xì)內(nèi)容。更多信息請關(guān)注PHP中文網(wǎng)其他相關(guān)文章！

本站聲明

本文內(nèi)容由網(wǎng)友自發(fā)貢獻(xiàn)，版權(quán)歸原作者所有，本站不承擔(dān)相應(yīng)法律責(zé)任。如您發(fā)現(xiàn)有涉嫌抄襲侵權(quán)的內(nèi)容，請聯(lián)系admin@php.cn