Identify overfitting and underfitting through learning curves
Apr 29, 2024 pm 06:50 PMThis article will introduce how to effectively identify overfitting and underfitting in machine learning models through learning curves.
Underfitting and overfitting
1. Overfitting
If a model is overtrained on data to the point that it learns noise from it, the model is said to be overfitted. An overfitted model learns every example so perfectly that it will misclassify an unseen/new example. For an overfitted model, we will get a perfect/near-perfect training set score and a terrible validation set/test score.
Slightly modified: "The reason for overfitting: Use a complex model to solve a simple problem and extract noise from the data. Because a small data set may not be used as a training set Represents the correct representation of all data. "
2. Underfitting
If a model cannot correctly learn the patterns in the data, we Let's just say it's underfitting. Underfitting models do not fully learn every example in the data set. In this case, we see that the errors on both the training and validation sets are low. This may be because the model is too simple and does not have enough parameters to fit the data. We can try to increase the complexity of the model, increase the number of layers or neurons, to solve the under-fitting problem. However, it should be noted that increasing model complexity also increases the risk of overfitting.
Reasons why it is not suitable: Using a simple model to solve a complex problem, the model cannot learn all the patterns in the data, or the model incorrectly learns the patterns of the underlying data. In data analysis and machine learning, model selection is very important. Choosing the right model for your problem can improve the accuracy and reliability of your predictions. For complex problems, more complex models may be needed to capture all patterns in the data. In addition, you also need to consider the
learning curve
The learning curve draws the training sum of the training sample itself by incrementally adding new training samples. Verification loss. Can help us determine if we need to add additional training examples to improve the validation score (score on unseen data). If the model is overfitted, then adding additional training examples may improve the model's performance on unseen data. Likewise, if a model is underfit, then adding training examples may not be useful. The 'learning_curve' method can be imported from Scikit-Learn's 'model_selection' module.
from sklearn.model_selection import learning_curve
We will demonstrate using logistic regression and Iris data. Create a function called "learn_curve" that will fit a logistic regression model and return cross-validation scores, training scores, and learning curve data.
#The function below builds the model and returns cross validation scores, train score and learning curve data def learn_curve(X,y,c): ''' param X: Matrix of input featuresparam y: Vector of Target/Labelc: Inverse Regularization variable to control overfitting (high value causes overfitting, low value causes underfitting)''' '''We aren't splitting the data into train and test because we will use StratifiedKFoldCV.KFold CV is a preferred method compared to hold out CV, since the model is tested on all the examples.Hold out CV is preferred when the model takes too long to train and we have a huge test set that truly represents the universe''' le = LabelEncoder() # Label encoding the target sc = StandardScaler() # Scaling the input features y = le.fit_transform(y)#Label Encoding the target log_reg = LogisticRegression(max_iter=200,random_state=11,C=c) # LogisticRegression model # Pipeline with scaling and classification as steps, must use a pipelne since we are using KFoldCV lr = Pipeline(steps=(['scaler',sc],['classifier',log_reg])) cv = StratifiedKFold(n_splits=5,random_state=11,shuffle=True) # Creating a StratifiedKFold object with 5 folds cv_scores = cross_val_score(lr,X,y,scoring="accuracy",cv=cv) # Storing the CV scores (accuracy) of each fold lr.fit(X,y) # Fitting the model train_score = lr.score(X,y) # Scoring the model on train set #Building the learning curve train_size,train_scores,test_scores =learning_curve(estimator=lr,X=X,y=y,cv=cv,scoring="accuracy",random_state=11) train_scores = 1-np.mean(train_scores,axis=1)#converting the accuracy score to misclassification rate test_scores = 1-np.mean(test_scores,axis=1)#converting the accuracy score to misclassification rate lc =pd.DataFrame({"Training_size":train_size,"Training_loss":train_scores,"Validation_loss":test_scores}).melt(id_vars="Training_size") return {"cv_scores":cv_scores,"train_score":train_score,"learning_curve":lc}
The above code is very simple, it is our daily training process. Now we start to introduce the use of learning curve
1. Learning curve of the fitted model
We will use the 'learn_curve' function to obtain a good fitted model by setting the anti-regularization variable/parameter 'c' to 1 (i.e. we don't perform any regularization).
lc = learn_curve(X,y,1) print(f'Cross Validation Accuracies:\n{"-"*25}\n{list(lc["cv_scores"])}\n\n\ Mean Cross Validation Accuracy:\n{"-"*25}\n{np.mean(lc["cv_scores"])}\n\n\ Standard Deviation of Deep HUB Cross Validation Accuracy:\n{"-"*25}\n{np.std(lc["cv_scores"])}\n\n\ Training Accuracy:\n{"-"*15}\n{lc["train_score"]}\n\n') sns.lineplot(data=lc["learning_curve"],x="Training_size",y="value",hue="variable") plt.title("Learning Curve of Good Fit Model") plt.ylabel("Misclassification Rate/Loss");
#In the above results, the cross-validation accuracy is close to the training accuracy.
Training loss (blue): The learning curve of a good fitted model will gradually decrease and decrease as the number of training examples increases. It gradually becomes flat, indicating that adding more training examples does not improve the model's performance on the training data.
Validation loss (yellow): The learning curve of a well-fitted model has a high validation loss at the beginning, which gradually decreases and gradually decreases as the number of training examples increases. tends to be flat, indicating that the more samples, the more patterns can be learned. These patterns will be helpful for "unseen" data
Finally, you can also see that in After adding a reasonable number of training examples, the training loss and validation loss approach each other.
2. Learning Curve of Overfitting Model
We will use the 'learn_curve' function by deregularizing the variable/parameter 'c 'Set to 10000 to get an overfitted model (high values ??of 'c' result in overfitting).
lc = learn_curve(X,y,10000) print(f'Cross Validation Accuracies:\n{"-"*25}\n{list(lc["cv_scores"])}\n\n\ Mean Cross Validation Deep HUB Accuracy:\n{"-"*25}\n{np.mean(lc["cv_scores"])}\n\n\ Standard Deviation of Cross Validation Accuracy:\n{"-"*25}\n{np.std(lc["cv_scores"])} (High Variance)\n\n\ Training Accuracy:\n{"-"*15}\n{lc["train_score"]}\n\n') sns.lineplot(data=lc["learning_curve"],x="Training_size",y="value",hue="variable") plt.title("Learning Curve of an Overfit Model") plt.ylabel("Misclassification Rate/Loss");
與擬合模型相比,交叉驗(yàn)證精度的標(biāo)準(zhǔn)差較高。
過(guò)擬合模型的學(xué)習(xí)曲線一開(kāi)始的訓(xùn)練損失很低,隨著訓(xùn)練樣例的增加,學(xué)習(xí)曲線逐漸增加,但不會(huì)變平。過(guò)擬合模型的學(xué)習(xí)曲線在開(kāi)始時(shí)具有較高的驗(yàn)證損失,隨著訓(xùn)練樣例的增加逐漸減小并且不趨于平坦,說(shuō)明增加更多的訓(xùn)練樣例可以提高模型在未知數(shù)據(jù)上的性能。同時(shí)還可以看到,訓(xùn)練損失和驗(yàn)證損失彼此相差很遠(yuǎn),在增加額外的訓(xùn)練數(shù)據(jù)時(shí),它們可能會(huì)彼此接近。
3、欠擬合模型的學(xué)習(xí)曲線
將反正則化變量/參數(shù)' c '設(shè)置為1/10000來(lái)獲得欠擬合模型(' c '的低值導(dǎo)致欠擬合)。
lc = learn_curve(X,y,1/10000) print(f'Cross Validation Accuracies:\n{"-"*25}\n{list(lc["cv_scores"])}\n\n\ Mean Cross Validation Accuracy:\n{"-"*25}\n{np.mean(lc["cv_scores"])}\n\n\ Standard Deviation of Cross Validation Accuracy:\n{"-"*25}\n{np.std(lc["cv_scores"])} (Low variance)\n\n\ Training Deep HUB Accuracy:\n{"-"*15}\n{lc["train_score"]}\n\n') sns.lineplot(data=lc["learning_curve"],x="Training_size",y="value",hue="variable") plt.title("Learning Curve of an Underfit Model") plt.ylabel("Misclassification Rate/Loss");
與過(guò)擬合和良好擬合模型相比,交叉驗(yàn)證精度的標(biāo)準(zhǔn)差較低。
欠擬合模型的學(xué)習(xí)曲線在開(kāi)始時(shí)具有較低的訓(xùn)練損失,隨著訓(xùn)練樣例的增加逐漸增加,并在最后突然下降到任意最小點(diǎn)(最小并不意味著零損失)。這種最后的突然下跌可能并不總是會(huì)發(fā)生。這表明增加更多的訓(xùn)練樣例并不能提高模型在未知數(shù)據(jù)上的性能。
總結(jié)
在機(jī)器學(xué)習(xí)和統(tǒng)計(jì)建模中,過(guò)擬合(Overfitting)和欠擬合(Underfitting)是兩種常見(jiàn)的問(wèn)題,它們描述了模型與訓(xùn)練數(shù)據(jù)的擬合程度如何影響模型在新數(shù)據(jù)上的表現(xiàn)。
分析生成的學(xué)習(xí)曲線時(shí),可以關(guān)注以下幾個(gè)方面:
- 欠擬合:如果學(xué)習(xí)曲線顯示訓(xùn)練集和驗(yàn)證集的性能都比較低,或者兩者都隨著訓(xùn)練樣本數(shù)量的增加而緩慢提升,這通常表明模型欠擬合。這種情況下,模型可能太簡(jiǎn)單,無(wú)法捕捉數(shù)據(jù)中的基本模式。
- 過(guò)擬合:如果訓(xùn)練集的性能隨著樣本數(shù)量的增加而提高,而驗(yàn)證集的性能在一定點(diǎn)后開(kāi)始下降或停滯不前,這通常表示模型過(guò)擬合。在這種情況下,模型可能太復(fù)雜,過(guò)度適應(yīng)了訓(xùn)練數(shù)據(jù)中的噪聲而非潛在的數(shù)據(jù)模式。
根據(jù)學(xué)習(xí)曲線的分析,你可以采取以下策略進(jìn)行調(diào)整:
- 對(duì)于欠擬合:
- 增加模型復(fù)雜度,例如使用更多的特征、更深的網(wǎng)絡(luò)或更多的參數(shù)。
- 改善特征工程,嘗試不同的特征組合或轉(zhuǎn)換。
- 增加迭代次數(shù)或調(diào)整學(xué)習(xí)率。
- 對(duì)于過(guò)擬合:
使用正則化技術(shù)(如L1、L2正則化)。
減少模型的復(fù)雜性,比如減少參數(shù)數(shù)量、層數(shù)或特征數(shù)量。
增加更多的訓(xùn)練數(shù)據(jù)。
應(yīng)用數(shù)據(jù)增強(qiáng)技術(shù)。
使用早停(early stopping)等技術(shù)來(lái)避免過(guò)度訓(xùn)練。
通過(guò)這樣的分析和調(diào)整,學(xué)習(xí)曲線能夠幫助你更有效地優(yōu)化模型,并提高其在未知數(shù)據(jù)上的泛化能力。
The above is the detailed content of Identify overfitting and underfitting through learning curves. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Improve developer productivity, efficiency, and accuracy by incorporating retrieval-enhanced generation and semantic memory into AI coding assistants. Translated from EnhancingAICodingAssistantswithContextUsingRAGandSEM-RAG, author JanakiramMSV. While basic AI programming assistants are naturally helpful, they often fail to provide the most relevant and correct code suggestions because they rely on a general understanding of the software language and the most common patterns of writing software. The code generated by these coding assistants is suitable for solving the problems they are responsible for solving, but often does not conform to the coding standards, conventions and styles of the individual teams. This often results in suggestions that need to be modified or refined in order for the code to be accepted into the application

To learn more about AIGC, please visit: 51CTOAI.x Community https://www.51cto.com/aigc/Translator|Jingyan Reviewer|Chonglou is different from the traditional question bank that can be seen everywhere on the Internet. These questions It requires thinking outside the box. Large Language Models (LLMs) are increasingly important in the fields of data science, generative artificial intelligence (GenAI), and artificial intelligence. These complex algorithms enhance human skills and drive efficiency and innovation in many industries, becoming the key for companies to remain competitive. LLM has a wide range of applications. It can be used in fields such as natural language processing, text generation, speech recognition and recommendation systems. By learning from large amounts of data, LLM is able to generate text

Large Language Models (LLMs) are trained on huge text databases, where they acquire large amounts of real-world knowledge. This knowledge is embedded into their parameters and can then be used when needed. The knowledge of these models is "reified" at the end of training. At the end of pre-training, the model actually stops learning. Align or fine-tune the model to learn how to leverage this knowledge and respond more naturally to user questions. But sometimes model knowledge is not enough, and although the model can access external content through RAG, it is considered beneficial to adapt the model to new domains through fine-tuning. This fine-tuning is performed using input from human annotators or other LLM creations, where the model encounters additional real-world knowledge and integrates it

Machine learning is an important branch of artificial intelligence that gives computers the ability to learn from data and improve their capabilities without being explicitly programmed. Machine learning has a wide range of applications in various fields, from image recognition and natural language processing to recommendation systems and fraud detection, and it is changing the way we live. There are many different methods and theories in the field of machine learning, among which the five most influential methods are called the "Five Schools of Machine Learning". The five major schools are the symbolic school, the connectionist school, the evolutionary school, the Bayesian school and the analogy school. 1. Symbolism, also known as symbolism, emphasizes the use of symbols for logical reasoning and expression of knowledge. This school of thought believes that learning is a process of reverse deduction, through existing

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

According to news from this site on August 1, SK Hynix released a blog post today (August 1), announcing that it will attend the Global Semiconductor Memory Summit FMS2024 to be held in Santa Clara, California, USA from August 6 to 8, showcasing many new technologies. generation product. Introduction to the Future Memory and Storage Summit (FutureMemoryandStorage), formerly the Flash Memory Summit (FlashMemorySummit) mainly for NAND suppliers, in the context of increasing attention to artificial intelligence technology, this year was renamed the Future Memory and Storage Summit (FutureMemoryandStorage) to invite DRAM and storage vendors and many more players. New product SK hynix launched last year

In the world of front-end development, VSCode has become the tool of choice for countless developers with its powerful functions and rich plug-in ecosystem. In recent years, with the rapid development of artificial intelligence technology, AI code assistants on VSCode have sprung up, greatly improving developers' coding efficiency. AI code assistants on VSCode have sprung up like mushrooms after a rain, greatly improving developers' coding efficiency. It uses artificial intelligence technology to intelligently analyze code and provide precise code completion, automatic error correction, grammar checking and other functions, which greatly reduces developers' errors and tedious manual work during the coding process. Today, I will recommend 12 VSCode front-end development AI code assistants to help you in your programming journey.
