Aiming at effective outlier elimination in the biological near-infrared spectral and
achieving high accuracy predictive modeling, this paper proposes a novel outlier elimination
method based on X-Y variance and leverage analysis. Firstly, the characters of near-infrared
spectral are summarized; then residual sample X-variance, leverage, and residual sample
Y-variance are concatenated as a divergence measurement. We further compared the
proposed method with X-Y variance, Mahalanobis distance, and Hotelling T2 statistical analysis; the experiment results demonstrate that the proposed methods have competitive outlier elimination and better performance in time complexity and accuracy. The proposed method
can also be adopted for other outlier elimination tasks.
1. Introduction
In the past few years, quality safety of food and supplies is becoming increasingly severe in China with rapid advance of national economy [1–5]. Rapid detection of quality is getting higher in industrial manufacturing, agricultural production, and commercial popularizing. Near-infrared spectroscopy is paid close attention to as a new direction and method of rapid detection, which develops hugely in agriculture, food, medicine, chemical, and so forth because of its advantages of high analysis speed, efficiency, nonpollution, and easy online detection [6–15]. However, as a kind of indirect analysis technology, near-infrared spectroscopy needs to establish the analysis model between spectral information and the nature of the data samples, parse out the correlation between the various spectral information and sample properties, and use the resulting calibration model to predict unknown samples in order to achieve the purpose of rapid detection [16–20]. Therefore the accuracy of the selected data is to be able to achieve ideal to predict the results.
However, during the whole process of spectrum data collection, it is likely to cause model nonrepresentative such as outlier samples due to experimental errors or sample collection and classification of uneven [21–25]. The existence of the stray samples will be affected to some extent and even change the distribution trend of overall data; thus they affect the accuracy of the calibration model. So quick and efficient sample removed from the group is the key to establishing correction model. At present, several commonly used analysis methods are based on multivariate statistical analysis to judge whether a statistics is beyond a specific threshold [26–28] (beyond a certain threshold), which have a certain effect, but the tested samples after the replacement of material data method are not reliable and sometimes cannot get satisfactory results.
In this paper, on the basis of analyzing several kinds of methods, we put forward a method based on X-Y (X represents the spectral information and Y chemical representative sample number) variance VS leverage value method of 3D sample removed from the group, using residual sample X-variance, leverage (leverage value), and the residual sample Y-variance as three-axis direction through the 3D view overall to judge from soy oleic acid value model [29, 30] and wheat straw biomass model [30, 31] from the group of the distribution of the sample. By comprehensive comparison of X-Y variance, Markova distance, and Hotelling T2 statistics as the traditional method to sample out from the group, its effect is ascended and improved obviously by repeating many times in the 3D view out the analysis. Final calibration model soybean oil acid value model and wheat straw biomass model R2 were promoted from 0.8653109, 0.843431 to upgrade to 0.966022, 0.934227. RSD (relative standard deviations) were also decreased to 5% and 8%.
2. Materials and Methods2.1. Samples Collection
Wheat straw and soybean oil were experimental verification objects, which measured wheat straw biomass (fermentation process) and soybean oil component in acid value, respectively. Wheat brans were collected from different places in Heilongjiang Province, whose total number was 123. And samples of soybean oil were deployed by manufacturing enterprise in various acid values. Kennard-Stone algorithm was used to calculate Euclidean distance among absorbance spectra of samples. In the meanwhile, the most representative samples were selected as calibration set and its sample distribution was as shown in Table 1. The unit of soybean acid value is mg KOH/g which represents desired quality of potassium hydroxide in 1 g fat of free fatty acids. Unit of wheat bran straw biomass (fermenting microorganisms) is mg/g which represents microbial cell density per unit volume. Its distribution of the sample is shown in Table 1; their sample classification is shown in Tables 2 and 3.
In the experiment using Thermo Antaris II near-infrared spectrometers scan for soybean oil and wheat bran, which range from 830 nm to 2500 nm (12000 cm^{−1}–4000 cm^{−1}) the resolution of 8 cm^{−1}. Respectively, using sweep surface mode of liquid transmission and integration ball, empty transparent glass is selected to contrast and air is used as a comparison object sphere scanning. Before sample surface scanning, we set the number of background scan 32 times and experiment scan 32 times. Its sample scan results as shown in Figure 1, as sequence is soybean oil and wheat straw biomass spectrum.
Soybean oil and wheat straw biomass spectrum.
Soybean oil spectrum
Wheat straw spectrum
2.3. Samples’ Chemical Calibration
(1) Oil samples are dissolved with mixed neutral alcohol-ether solvent; then free fatty acids are titrated with alkali standard solution. According to the quality of oil and amount of alkali consumption, acid value is calculated. We need the following orders including reagent, instruments and appliances, and operation.
① Reagent. Consider the following:
the 0.1 mol/L KOH (or sodium hydroxide standard solution);
the neutral ether-ethanol (2 : 1) mixed solvent, with 0.1 mol/L alkali to titrate to neutral before using;
the l g/100 mL indicator of phenolphthalein-ethanol.
② Instruments and Appliances. Consider the following:
the burette of 25 or 50 mL;
the Erlenmeyer flask of 250 mL;
the balance of sensitivity of 0.001 g;
the volumetric flask, pipette, the weighing bottle, reagent bottle, a graduated flask, and beakers.
③ Operational Approach. The main steps are as follows: firstly, uniform specimen (3–5 g) is weighed to inject into the conical flask. Secondly, neutral mixed solution (50 mL) is added and the conical flask is shaken to dissolve the internal solution completely. Then 3 drops of phenolphthalein indicator are added to the conical flask and mixed. Lastly, 0.1 N solution of potassium hydroxide is titrated to the conical flask to show reddish and to maintain 30 s. Also, the consumption of potassium hydroxide solution mL number is written down. Finally, oil acid is valued by the form(1)Acidvalue(mgKOH/g)=V×N×56.1W,where V = titration consumption potassium hydroxide solution volume (mL), N = the concentration of KOH solution (mol/L), 56.1 = the molar mass of KOH (g/mol), and W = sample quality (g).
The experimental results allow the error not to exceed 0.2 mg KOH/g and the average is the determination results. The distribution range of oil acid value (52) is measured of 0.473~3.102 mg KOH/g.
(2) Fermentation microbial biomass is chosen as the research object in the experiment. Monitoring biomass of bacterial colony, which is measured by glucosamine method, is vital. Its biomass stands for microbial density per unit volume. The preparation and determination process is as follows.
① Reagents. Concentrated sulfuric acid, acetyl acetone, sodium hydroxide, peptone, sodium nitrate, magnesium sulfate, potassium dihydrogen phosphate, and dibenzaldehyde are used.
② Instruments and Appliances. The following are used: UV3000 UV-visible spectrophotometer, FA1004 electronic balance, and SPX-250B-Z-type incubator.
③ Operation Calculation. Firstly, dry cell (0.1 g) is weighed precisely and solid-state fermentation substrate (0.5 g) with sulfuric acid (2 mL 60%) is soaked for 24 hours, diluting to 1 mol/L then placing into flask (250 mL) with heating one hour in 9.8 × 104 Pa high-pressure. After cooling, diluent is neutralized with sodium hydroxide (1 mol/L) to pH7.0, setting the volume to 100 mL. Secondly, according to Elson Morgan, absorbance is measured at 530 nm based on five parallel samples per specimen, choosing the average value as the absorbance of the sample. At last, distilled water (2 mL) was used as a blank cell in measuring the quality of solid-state fermentation matrix glucosamine.
Biomass is estimated as follows:(2)Biomassmgmgdrymediu=C(mg/mgdrymediu)B(mg/mgdrymediu)×1000,where C = unit mass culture glucosamine (%) and B = unit mass of bacteria in the glucosamine content (%).
3. Results and Discussion
Aiming at the acid value model for soybeans and wheat straw biomass, respectively, X-Y model variance, Mahalanobis distance, and Hotelling T2 statistics, we compare leverage 3D and X-Y variance method. Meanwhile, we select 4450 cm–5000 cm^{−1} model features band as soybean acid number and 9000 cm–10000 cm^{−1} model features as biomass wheat straw band. Unscrambler 10.2 software and the preparation of the model itself Matlab programming are used to analyze. Waveband selection is shown in Tables 4 and 5.
First, the samples for the correction X-Y model of variance were analyzed where X (residual sample X-variance) represents sample spectrum and Y (residual sample Y-variance) represents the chemical values of samples. Under normal circumstances, the residuals are calculated for the Y: the greater the variance of a sample of Y is, namely, correct model for its ability to fit is weaker, the lower the explanatory power is. From Figure 2, it can be seen that (Figure 1(a) shows the spectrum of acid value of soybean, Figure 1(b) shows the spectrum of wheat straw), in soybean acid value model, two samples, number 44 and number 46, have significantly higher variance and in biomass straw bran model, four samples, numbers 15, 18, 43, and 99, have a relatively high variance, which can be regarded as outliers (abnormal) samples to be removed.
Y-variance distribution of the model.
Y-variance distribution of the soybean oil model
Y-variance distribution of the wheat straw biomass model
3.2. Mahalanobis Distance Analysis
In near-infrared spectroscopy, Euclidean distance and Mahalanobis distance are important method for determining the abnormal samples. Compared to the Euclidean distance, Mahalanobis distance as taking into account the links between the various characteristics is widely used. In this experiment, two calibration models were Mahalanobis distance calculation, the calculation results shown in Figure 3. As can be seen from the figure, in the acid value of soybean model, a sample of 52 represents strongly and small differences are in the overall Mahalanobis distance. Maximum distance of two samples numbers 40, 49 is trimmed. In wheat bran straw biomass models, the samples exhibit greater volatility in Mahalanobis distance, where numbers 47, 52, and 90 samples have significant differences and should be considered rejected, due to the messy sample distribution (min 2.9–max 99.2).
Mahalanobis distance calculations.
Mahalanobis distance of the soybean oil model
Mahalanobis distance of the wheat straw biomass model
Hotelling T2 statistics in the multivariate statistical analysis is a kind of important statistics. It is a two-dimensional elliptic model based on Principal Component Analysis (PCA), which is mainly used to test the stability of multivariate. If principal components of all observables are stable, its T2 statistic would be maintained at a stable level. Then, abnormal situations are detected by associated critical limit. From the figure, when the limit is 5%, in the soybean oil acid value model, samples 48, 49 obviously deviate from the center of the circle, far beyond the limit, and in the wheat straw bran biomass model, samples 13, 14, and 17 are also beyond the limit, which can be considered excluded samples from Figure 4.
Hotelling T2 statistic.
Hotelling T2 statistic of the soybean oil model
Hotelling T2 statistic of the wheat straw biomass model
3.4. 3D View Analysis of <inline-formula>
<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" id="M43">
<mml:mi>X</mml:mi>
<mml:mtext>-</mml:mtext>
<mml:mi>Y</mml:mi></mml:math>
</inline-formula> Variance and Leverage Value
Calculate the leverage [32–34] value and X-Y variance [35] value in the model of acid value of soybean and wheat straw biomass, the leverage value is very useful to detect whether the sample is far from the space center of model. Samples with high leverage value are likely to be outliers, which also has a great influence on the model accuracy. The outlier (abnormal) samples to establish the 3D view of the Leverage and the X-Y variance, which select residual sample X-variance as the x-axis, leverage (leverage) as the y-axis, and residual sample Y-variance as the z-axis and comprehensive judgments, as can be seen from Figure 5. In the process of the entire model fitting, most of the samples are uniformly distributed in the center of the 3D view, but a small portion of the samples is far away from the center, and the distance between the center and the far sample of the X-Y variance and the leverage is very large. As shown in Figure 5, it can be seen that the samples 44, 46 in the soybean oil acid value model deviate significantly from the center far; the result of its analysis is consistent with the X-Y variance; in the wheat straw bran biomass model, samples 15, 23, 70, and 83 have obvious abnormalities, Determining integrated values in three directions, the above several points can be excluded.
3D view of X-Y variance and leverage.
3D view of X-Y variance and leverage of soybean oil model
3D view of X-Y variance and leverage of wheat straw biomass model
3.5. Model Demonstration
Methods for excluding outliers above sample select cross validation (cross validation) to verify. Figure 6 shows, where soybean oil with 52 samples and 123 samples of wheat straw bran established PLSR (Partial Least Squares Regression) model without any discrete rejection, we can see that the decision coefficient R2 of soybean oil acid value model is 0.8653109 and the value of wheat straw bran biomass model is 0.870438; both the coefficients of determination are not high, and from the marked area in the Figure 7 we can see that the samples 44, 46 in the soybean oil acid value model are separated from the entire calibration curve, which is consistent with our previous conclusions. In the wheat straw bran biomass model, because of the existence of some stray samples, the value of sample 1 in the process of fitting is 0, instead of the original chemical numerical error which is very large.
Original spectral modeling.
Original spectral modeling of the soybean oil model
Original spectral modeling of the wheat straw biomass model
Modeling after excluding the outlier sample.
3D view excluding numbers 18 and 21 samples
3D view excluding number 27 sample
3D view excluding numbers 9 and 7 samples
3D view excluding number 91 sample
After various methods for excluding outliers, the accuracy of the model is shown in Table 6. In the soybean oil acid value model, X-Y variance method for selecting outliers is consistent with 3D view analysis and both the best main factors are 5. However, the accuracy of calibration model after Mahalanobis distance analysis and Hotelling’s T2 statistics is even slightly lower. In the wheat straw bran biomass model, the best method is based on the 3D view analysis. The best main factor is 27; X-Y variance, Hotelling’s T2 statistics, and the Mahalanobis distance methods compared to the original spectrum also have improved.
Soybean oil acid value model after excluding outliers.
Method
Excluding outliers
Factor
R2
RMSE
X-Y varianc
44, 46
5
0.966022
0.136879
Mahalanobis distance
48, 49
4
0.846159
0.299031
Hotelling T2
48, 49
4
0.847478
0.292275
3D view analysis
44, 46
5
0.966022
0.136879
Through Tables 6 and 7 analysis, it was found that the best two models after excluding outliers are based on 3D view analysis of X-Y variance and leverage value. In the wheat straw biomass model, the coefficient of determination of calibration R2 upgrade to 0.911626 and root mean square error (RMSE) is 9.060135. In practical, we found that the accuracy of the model can be further improved if we repeated the 3D view analysis. Therefore, according to this idea, we continue to do this work like the above, which in Figure 7(a) is the wheat straw bran biomass model with samples 18, 39, 74 being excluded which in Figure 7(a) is the Wheat straw bran biomass model with sample 18, 21 are excluded. Figures 7(b)–7(d) for turn samples 27, (9,7), 91, are excluded; the correction model coefficient of determination R2 upgrade to 0.934227 and RMSE reduce to 8.4943037. It can be seen that the accuracy of the calibration model improved significantly after several times outliers excluding.
Wheat straw biomass model after excluding outliers.
Method
Excluding outliers
Factor
R2
RMSE
X-Y varianc
15, 18, 43, 99
27
0.905218
10.11064
Mahalanobis distance
47, 52, 90
27
0.872461
11.78504
Hotelling T2
13, 14, 17
27
0.881279
11.43546
3D view analysis
18, 21
27
0.911626
9.60135
4. Conclusions
By the soybean oil acid value model and the wheat straw bran biomass model research experiment and comparative analysis to prove that in the near-infrared spectroscopy analysis methods for excluding outliers based on 3D view analysis of X-Y variance VS leverage are effective, comprehensive analysis and judgment of the methods, residual sample X-variance, leverage (leverage), and residual sample Y-variance are more comprehensive accurate judgment of abnormal samples and the accuracy of the model improved significantly after several times outliers excluding. This method also for the future of near-infrared spectroscopy outlier samples excluded proposes a new direction.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The authors would like to acknowledge the financial support from the National High-tech R&D Program of China (863 Program) (2013AA102303); Natural Science Foundation of Heilongjiang Province of China (F201402); and Key Technologies R&D Program of Harbin (2013AA6BN010).
VakninY.GhanimM.SamraS.DvashL.HendelsmanE.EisikowitchD.SamochaY.Predicting Jatropha curcas seed-oil content, oil composition and protein content using near-infrared spectroscopy-a quick and non-destructive methodAgeletL. E.ArmstrongP. R.ClarianaI. R.HurburghC. R.Measurement of single soybean seed attributes by near-infrared technologies. A comparative studyHouS.LiL.Rapid characterization of woody biomass digestibility and chemical composition using near-infrared spectroscopyVogelK. P.DienB. S.JungH. G.CaslerM. D.MastersonS. D.MitchellR. B.Quantifying actual and theoretical ethanol yields for switchgrass strains using NIRS analysesHacisalihogluG.LarbiB.Mark SettlesA.Near-infrared reflectance spectroscopy predicts protein, starch, and seed weight in intact seeds of common bean (Phaseolus vulgaris L.)BlankeM. M.Non-invasive assessment of firmness and NIR sugar (TSS) measurement in apple, pear and kiwi fruitLiuY.SunX.ZhangH.AiguoO.Nondestructive measurement of internal quality of Nanfeng mandarin fruit by charge coupled device near infrared spectroscopyYangH.Remote sensing technique for predicting harvest time of tomatoesLiuL.YeX. P.WomacA. R.SokhansanjS.Variability of biomass chemical composition and rapid analysis using FT-NIR techniquesEsteve AgeletL.EllisD. D.DuvickS.GoggiA. S.HurburghC. R.GardnerC. A.Feasibility of near infrared spectroscopy for analyzing corn kernel damage and viability of soybean and corn kernelsAndersonP. V.KerrB. J.WeberT. E.ZiemerC. J.ShursonG. C.Determination and prediction of digestible and metabolizable energy from chemical analysis of corn coproducts fed to finishing pigsArmstrongP. R.TalladaJ. G.HurburghC.HildebrandD. F.SpechtJ. E.Development of single-seed near-infrared spectroscopic predictions of corn and soybean constituents using bulk reference values and mean spectraChenG. L.ZhangB.WuJ. G.ShiC. H.Nondestructive assessment of amino acid composition in rapeseed meal based on intact seeds by near-infrared reflectance spectroscopyChenL.YangZ.HanL.A review on the use of near-infrared spectroscopy for analyzing feed protein materialsCozannetP.PrimotY.GadyC.MétayerJ. P.LessireM.SkibaF.NobletJ.Energy value of wheat distillers grains with solubles for growing pigs and adult sowsZhangP. F.ZhangQ.DeinesT. W.PeiZ. J.WangD. H.Ultrasonic vibration-assisted pelleting of wheat straw: a designed experimental investigation on pellet quality and sugar yieldWangH.WuX.WangD.Acid-functionalized magnetic nanoparticle as catalyst for biodiesel synthesisJinS.ChenH.Near-infrared analysis of the chemical composition of rice strawQiG.LiN.WangD.SunX. S.Physicochemical properties of soy protein adhesives obtained by in situ sodium bisulfite modification during acid precipitationKerrB. J.DozierW. A.IIIShursonG. C.Effects of reduced-oil corn distillers dried grains with solubles composition on digestible and metabolizable energy value and prediction in growing pigsYanS.WuX.FaubionJ.BeanS. R.CaiL.ShiY.-C.SunX. S.WangD.Ethanol-production performance of ozone-treated tannin grain sorghum flourAl-AlawiA.Van De VoortF. R.SedmanJ.A new Fourier transform infrared method for the determination of moisture in edible oilsvan de voortF. R.SedmanJ.SheraziS. T.Improved FTIR trans analysis in edible oils using spectral reconstitutionLomborgC. J.ThomsenM. H.JensenE. S.EsbensenK. H.Power plant intake quantification of wheat straw composition for 2nd generation bioethanol optimization—a Near Infrared Spectroscopy (NIRS) feasibility studyJiangH.LiuG.XiaoX.MeiC.DingY.YuS.Monitoring of solid-state fermentation of wheat straw in a pilot scale using FT-NIR spectroscopy and support vector data descriptionKaparajuP.FelbyC.Characterization of lignin during oxidative and hydrothermal pre-treatment processes of wheat straw and corn stoverXuF.YuJ.TessoT.DowellF.WangD.Qualitative and quantitative analysis of lignocellulosic biomass using infrared techniques: a mini-reviewBruunS.JensenJ. W.MagidJ.LindedamJ.EngelsenS. B.Prediction of the degradability and ash content of wheat straw from different cultivars using near infrared spectroscopyLuoH.LuoQ.DingH.Southern jujube quality near-infrared spectroscopy online model parametersXuF.WangD.Rapid determination of sugar content in corn stover hydrolysates using near infrared spectroscopyDowellF.WangD.WuX.DowellK.Detecting the antimalarial artemisinin in plant extracts using near-infrared spectroscopyBelancheA.WeisbjergM. R.AllisonG. G.NewboldC. J.MoorbyJ. M.Estimation of feed crude protein concentration and rumen degradability by Fourier-transform infrared spectroscopyYanS.WuX.BeanS. R.PedersenJ. F.TessoT.ChenY. R.WangD.Evaluation of waxy grain sorghum for ethanol productionAiY.MedicJ.JiangH.WangD.JaneJ.-L.Starch characterization and ethanol production of sorghumLiN.WangY.TilleyM.BeanS. R.WuX.SunX. S.WangD.Adhesive performance of sorghum protein extracted from sorghum DDGS and flour