**What happens if you get bit by a killer hornet?**

#### Science

Killer hornets, also known as Asian giant hornets, have a venomous sting that can cause severe pain, swelling, and redness at the site of the sting. If you become stung by a killer hornet, you may also experience other symptoms, such as fever, dizziness, and difficulty breathing. In some cases, killer hornet stings can be life-threatening, particularly if you have an allergic reaction to the venom.

If stung by a killer hornet, it is important to seek medical attention immediately. Moreover, specially if you experience any severe symptoms or have a history of severe allergic reactions. Treatment may include medications to reduce pain and swelling, as well as antihistamines or epinephrine to treat an allergic reaction. In rare cases, surgery may be necessary to remove damaged tissue or relieve pressure caused by swelling.

**What happens if you get bit by a killer hornet?**

Are “Murder Hornets” Coming? An Automatic Identification System Based on Online Data Mining

**Summary **

In 2019, Asian giant hornets were first observed in Washington state. Also known as “murder hornets” and considered one of the most dangerous insects. To appease people’s anxiety and ensure their safety, the state government in Washington must take action to figure out the distribution of Asian giant hornets and prevent their continuing spread. We made the following analysis based on the data set given by the MCM competition.

Firstly, we conduct data processing based on **image recognition and text mining**. For image recognition, we conduct similarity analysis based on **the cosine similarity algorithm**. For text mining, we conduct **sentiment analysis **to collect polarity and sensitivity.

Secondly, we build **the diffusion prediction model **to predict the spread of Asian giant hornets. Our model shows that: (a) the **latitude distribution **of Asian giant hornets **is consistent **before and after a certain period. But its **longitude distribution is inconsistent**. In other words, Asian giant hornets tend to spread along longitude as time passes. (b) the annual spread of Asian giant hornets is **884 square kilometers**, with a **confidence interval of [768,975] **under **95% **confidence interval.

Thirdly, we set up **a discriminant model **based on **random forest**. To deal with imbalanced data, we optimized our model by applying **the cost-sensitive method**. Our optimized model shows a high level of accuracy. With our model, the government can distinct the potential positive reports from negative ones. In this way, the government can **prioritize with their limited resources**. Aside from this, we analyze the degree of importance of each influencing factor. The result shows that **Latitude**; considered to have **the greatest impact **on the proliferation of Asian giant hornets.

After finishing training and testing our model, we use **incremental learning **to update our model with the given additional new reports over time. Considering the **biological characteristics **of Asian giant hornets and the completeness of collected data, we choose to update the model **in January of each year**.

Additionally, we implement **the survival estimate model **to study whether the Asian giant hornets have become eradicated. We first use **the Kaplan-Meier survival estimate model **and then **the Cox proportional hazards model **for analysis. After building the model, we test the research value of the model. Lastly, we study the factors that affect the survival of Asian giant hornets and take these as indicators of the eradication problem of Asian giant hornets. For example, We can know that if a large number of **high-longitude reports**. As a result, we **make a judgment **that Asian giant hornets have become eliminated.

**Keywords**: Asian giant hornets; Image recognition; Random forest model; Incremental learning; Survival analysis

**1.1 Background **

In today’s world, biological invasion has become a major concern because of its impact on agriculture, commerce and many other aspects. One of the important influences of biological invasions is the enormous impact on local biodiversity, which is often impossible to estimate. Recently, the Vespa mandarinia invasion occurred in Washington State, which has attracted people’s attention. Vespa mandarinia (also known as the Asian Giant Hornet). As its name suggests, an insect found in the eastern and southeastern parts of Asia, especially in Japan.

Team # 2109118 Page 3 of 24

A Vespa mandarinia nest discovered on Vancouver Island in 2019 in British Columbia destroyed by people as soon as found. A short time later, this creature became found in Washington state. The first time it became found in this area. Vespa mandarinia is famous for being the world’s largest wasp species. They prefer to build their nests underground, where they can be hard to find.

Asian giant hornets usually feed on other kinds of insects, such as honey bees. Due to the size advantage of Asian giant hornets, most types of honeybees are not its opponents. The Asian giant hornet is also a threat to humans. In Japan, Asian giant hornets are responsible for the deaths of nearly 40 people every year. Because of the similar appearance, size and color, people tend to mistake other kinds of wasps, such as European hornets and Eastern cicada killers, for Asian giant hornets. Sending experts to judge whether all the sighting information provided by the public is correct will consume a lot of resources. Therefore, we will build new models to interpret relevant information and allocate resources more efficiently in future work.

**1.2 Clarification and problem restatement **

We have obtained a document containing sighting reports, a data set containing photos submitted with sighting reports, and a background knowledge document about Vespa mandarinia, which will help us determine the insect category. We are supposed to only use the given data to address the following problems:

• Process and analyze the data. We need to extract valid data and convert all kinds of data into appropriate forms, which will help us describe these data qualitatively or quantitatively.

• Take a test to determine whether the spread of Asian giant hornets was related to time. If so, we will try to find an appropriate way to predict the spread of Asian giant hornets.

• Establish a new model to study the probability of error when the public sighting reports classify judges.

Figure 1: Workflow

Team # 2109118 Page 4 of 24

• Discuss how to propose the optimal solution to the relevant departments to maximize the use of resources based on the model.

• Find ways to update the model in the case of changes in time and other factors.

Lastly, give evidence of the elimination of pests in Washington State, basing on the model or conclusion established above.

Our modeling framework can become illustrated as shown in Figure1.

**2 Assumptions and nomenclature **

**2.1 Assumptions **

When using models to solve problems, it is necessary for us to put forward some reasonable assumptions about the current situation to ensure the rigor of logic. In order to solve the problems raised in this article, we propose the following hypotheses.

• There is no abnormal climate change during the prediction period.

• The notes depict reporters’ real experience and feelings towards the insect they have witnessed.

In addition, we have noticed some abnormal detection data. Leaving those out, all reporters can remember and report the detection data correctly.

• The cost of modifying a trained model lower than the cost of retraining a model. • An insect that has not become eradicated will become witnessed and reported within finite time.

**2.2 Nomenclature **

Table 1: Nomenclature

Symbol Definition

*v*_*i *the plane vector corresponding to the images

*R *the rank of each data

*N *nest density and depends on time and spatial location

*D *the diffusion coefficient

*r *growth rate

*K *carrying capacity

*Q** _{xy }*the proportion of

*x*in the voting category of

*y*

*F*_*k *group *k *

*f*_*ki *single family decision tree model

Team # 2109118 Page 5 of 24

**3 Data processing and analysis **

**3.1 Data cleaning **

After careful observation, we deleted the error information for the detection date earlier than September 2019. Because we learned that in September 2019, the first time we found Asian giant hornets in this area. This part of the error message accounts for about 3.5% of the entire data set. Then, we deleted the files that had problems and could not become opened. Such as the record whose global ID is A8F5AA22-3F29-4533-A0DD-21204DA91E70. We also deleted the file type that does not correspond with the content. For example, the record whose global ID 787C861E-E4B1-4359- A46C-F812846F09DE supposed to contain the insect picture instead of an email picture. Finally, we also deleted the data whose submission date was earlier than the detection date. This part of the data is not in line with real life, accounting for about 0.4% of all the data.

**3.2 Data preprocessing **

Our data preprocessing presented in Figure 2.

**3.2.1 Image data preprocessing **

The image file we got contains a total of nine formats of files. Our image preprocessing is divided into the following steps:

Firstly, we converted files in all formats into images. We divided the files in nine formats into four categories. The first type of file contains four formats: “Jfif”, “jpg”, “png” and “octet-stream”. This type of file can be used directly. The second type of files are “Vnd” and “pdf”. After this data was opened, we got pictures by save operation. Furthermore, the third type of file is a compressed file containing multiple images. We can get the picture information after decompression operation. The last type of file is composed of video files in two formats: “mp4” and “quicktime”.

We converted the video into an image with the help of the opencv-python package in python. In addition, the specific process includes: loading the video, obtaining the total number of frames of the video, obtaining the height and width of the video frame, obtaining the frame rate of the video, calculating the number of total frames of the video, and saving the above information into the picture.

Figure 2: Data processing flow chart

Team # 2109118 Page 6 of 24

Then, we extracted the information that had been confirmed as positive, and there were 14 pieces in total. A total of 11 of these messages were accompanied by pictures. We used these images as specimens for subsequent analysis.

Finally, we processed the similarity of all the pictures in the file, and obtained the corresponding similarity index. The specific operation was to compare each piece with the above 11 specimen images and calculate the 11 similarity degrees. We took the average of these similarities to get the similarity index. We were based on a cosine similarity algorithm to carry out similarity processing. Firstly, we converted the image and the sample into two vectors on the plane. Next, we expressed the angle between the two vectors by calculating the direction cosine of the two vectors. The greater the angle is, the greater the difference is between the two vectors (pictures). In addition, the calculation of the direction cosine is shown in the following formula:

cos (*ν*_{i}*, ν*_{j }_{) = }*ν*_{i }*× ν*_{j}* *

*|ν*_{i}*| |ν*_{j}_{|}*, *(1)

where *v** _{i }*and

*vj is*the plane vector corresponding to the two images.

**3.2.2 Text data preprocessing **

We used September 1, 2019 as the base date and represented Detection Date and Submission Date as the difference between the corresponding date and the base date. For example, October 1, 2019 was represented by 30.

We quantified the emotional attitudes and subjective attitudes in Notes. With the help of the toolkit textblob, we extracted the emotionally inclined words (such as very good) and subjective tendentious words (such as probably think) in Notes. Then, according to the different degrees of emotional inclination expressed by different emotional words, we assigned different values to different words. Finally, we obtained Polarity, an indicator that can be used to measure emotional orientation, with values in the range of [*−*1*, *1]. We also obtained the index of Sensitivity, which can be used to measure subjectivity, and its value range is [0*, *1].

**3.2.3 Word frequency analysis **

In order to better understand the text data, we carried out word frequency analysis on Notes and Lab Comments respectively. Firstly, we removed the punctuation marks and pauses in these two types of information. Then, we made the word cloud map respectively, as shown in Figure 3. From the word cloud map of Notes, we can see that “hornet”, “large” and “long” appear most frequently, which are mainly descriptions of the insects. From the word cloud map of Lab Comments, we can find that “thanks” and “”submission” appear most frequently. Combining this map, we can learn that Lab’s reply mainly expresses gratitude to reporters.

**4 Task 1: Prediction of the spread of Asian giant hornets **

In this part, we used the Wilcoxon rank sum test to examine the relationship between the spread of Asian giant hornets and time. Then we created a Diffusion prediction model to predict the spread of Asian giant hornets.

Team # 2109118 Page 7 of 24 (a) Word cloud of notes (b) Word cloud of lab comments

Figure 3: Word cloud

**4.1 Overview of location **

In Figure 4, we can know the heat map of all lab status. In Figure 5, we find out the lab status is correlative with geography.

**4.2 Wilcoxon signed rank test **

**4.2.1 Data classification **

Based on the analysis of the data set, we divided the 14-month positive sighting reports into two categories evenly. The specific categories shown in Table 2.

Figure 4: The heat map of all lab status

Team # 2109118 Page 8 of 24

(a) The heat map of positive lab status (b) The heat map of unprocessed lab status

(c) The heat map of negative lab status (d) The heat map of unverified lab status

Figure 5: The heat map of separate lab status

Table 2: Classification of Eyewitness Reports

Type The time range of sighting reports

1 from September 2019 to March 2020

2 from April 2019 to October 2020

Team # 2109118 Page 9 of 24

We used longitude and latitude to describe the spread of Asian giant hornets. Moreover, in order to explore whether the latitude and longitude distributions of the two types of sighting reports at different times are significantly different. As a result, we used the Wilcoxon rank sum test.

We assume that

*H*_{0 }: *M** _{X }*=

*M*

_{Y}

*H*_{1 }: *M** _{X }*=

_{Y}*W*_{X }_{=}∑^{m}* *

*i*=1

where *R *stands for the rank of each data.

*R*_{i}*, W*_{Y }_{=}∑^{n}* j*=1

*R*_{j}*, *(2)

**4.2.2 The process of Wilcoxon rank sum test **

We used longitude and latitude to describe the spread of Asian giant hornets. In order to explore whether the latitude and longitude distributions of the two types of sighting reports at different times are significantly different, we used the Wilcoxon rank sum test. The result of the Wilcoxon rank sum test is that the p-value in different latitude groups is 0.2398 while the p-value in different longitude groups is 0.0070.

**4.2.3 The result of test **

Based on the above results, with a ninety-five percent confidence level, we can conclude that the latitude distribution of Asian giant hornets is consistent before and after a certain period of time. But its longitude distribution is inconsistent. In other words, the spread of Asian giant hornets has a tendency to change longitude, but does not change latitude.

**4.3 The diffusion prediction model **

**4.3.1 Model construction **

In order to describe the diffusion of the Asian giant hornets, we should consider multiple factors such as growth rate and human-mediation when building the model. We selected some important factors. With the help of C. Robinets method, Equation (1) and Equation (2) were established:

*∂N *

* _{∂t }*=

*D*

^{(}*∂*^{2}*N *

*∂x*^{2}_{+}*∂*^{2}*N ∂y*^{2}

)

+ *rN *

_{1 }_{−}^{N}_{K}^{)}*, *(3) (

where *N *is nest density and depends on time *t *and spatial location (*x, y*); *D *is the diffusion coefficient; *r *is growth rate; and *K *is carrying capacity.

_{c }_{= 2}^{√}*rD, *(4)

where *c *is estimated by measuring the radial distances between the range edge observed in consecutive years, We believed that the *K *value was constant due to the assumption that the nest of Asian giant hornets in Washington state may be the location of the uniform distribution. *r *(representing growth rate) is a variable related to climate. Based on the research of Loomans, we reasonably

Team # 2109118 Page 10 of 24

estimated the value of *r*. Because the time Asian giant hornets appear in the United States is too short, some of the parameters required to estimate data cannot be obtained. We learned that France was invaded by Asian giant hornets in 2004. Considering that the two biological invasions are similar, we selected part of the previous research data to help build this new model.

**4.3.2 Model application **

Using the model to calculate the corresponding variables, we got the diffusion coefficient D is equal to 884 square kilometers per year. This shows that the annual spread of Asian giant hornets is 884 square kilometers. We also obtained a confidence interval of [768*, *975] with a ninety-five percent confidence level.

The above *D *was calculated as an average value. Asian giant hornets may spread over long distances through human activities. Hence, the *D *value in areas with intensive human activities (high population density) is likely to be greater than this value.

In addition, the elimination of Asian giant hornets also has an impact on its spread. After locals find Asian giant hornets, they will report to the relevant department who will kill Asian giant hornets. Beekeepers will also eliminate these threatening Asian giant hornets. Under the influence of these control methods, the spread of Asian giant hornets will be effectively curbed.

**5 Task2: The random forest model **

In this part, we completed the discriminant analysis by establishing the random forest. Discriminant analysis is to find out the regularity of various categories of things by analyzing the known data, and then establish a new discriminant criterion. When presented with a new sample, we can classify it according to established criteria. Random forest algorithm was created by Breiman and Cutler in 2001. Compared with other methods, random forest has the advantage of higher accuracy and is widely used by people. Random forest is made up of many random trees. The structure of a random forest can be shown in Figure 6.

**5.1 Model construction **

Firstly, we used the CART algorithm to calculate the decision trees because the foundation of random forest is random trees. Recursion and information entropy are the core of the CART algorithm. Compared with other algorithms, the main feature of the CART algorithm is: when the nodes are separated, the CART algorithm uses the Gini index minimum principle.

Then, we constructed a random forest with many decision trees. In the process of constructing a random forest, we can use Q as an estimate of the correct classification of random variables. Then

we have,

∑

_{k}*I *(*h** _{k}*(

*x*) =

*y*

_{j}*,*(

*x, y*)

*∈ O*

*(*

_{k}*x*))

*Q *(*x, y** _{j }*) =

∑

_{k}*I *(*h** _{k}*(

*x*)

*,*(

*x, y*)

*∈ O*

_{k}_{(}

_{x}_{)) }

*,*(5)

where *Q *represents the proportion of *x *in the voting category of *y*, *I*(*x*) is the demonstration function, *O *is the data set that is not extracted.

Next, we used *h** _{k}*(

*k*= 1

*,*2

*,*3

*…*) to represent decision trees, which constitute a random forest. Furthermore, we used the test set data for validation. We input the factor

*X*related to the

Team # 2109118 Page 11 of 24

Figure 6: The structure of the random forest model

diffusion *Y *of Asian giant hornets into the decision tree to get the prediction result of the decision tree.The prediction results based on random forest were obtained by voting. Equation (5) shows the prediction method of random forest.

Then, we have:

*F** _{k }*(

*X*

*) = arg max*

_{k}*Y*

_{k}

∑^{w}* i*=1

*I *(*f** _{ki }*(

*X*

*) =*

_{k}*Y*

*)*

_{k}*, k*= 1

*,*2

*, · · · , n,*(6)

where *F** _{k }*is the model which is designed for group

*G*

*,*

_{k}*f*

*is single family decision tree model, and*

_{ki }*I*(

*∗*) is indicative function.

**5.2 Model application **

We separated the data that had been determined as positive and negative as samples to carry out training on the established model. We divided the sample information into two categories based on whether it contains pictures or not, and then conducted discriminant analysis with models respectively. To describe the accuracy and sensitivity of the model, we define new indicators Accuracy and Sensitivity, as shown in equations below, respectively:

_{Accuracy }_{=}TP + TN

_{TP + TN + FP + FN}*, *(7)

Team # 2109118 Page 12 of 24

where *T P *means that the information that was originally positive judged to be correct, *F P *means that the information that was originally positive judged incorrectly, *T N *refers to negative information judged correctly, *F N *refers to the information that was originally negative judged incorrectly.

*Sensitivity *= *T P/*(*T P *+ *F N*) (8)

After discrimination and calculation, we can get the conclusion shown in Table 3. Table 3: Results of improved pre-training model

predict to be Negative ID Positive ID class error Accuracy Sensitivity _{contain photos}^{negative 2005 0 0.00000.996 1.000 }positive 8 3 0.7273 _{not contain photos}^{negative 24 0 0.00000.926 1.000 }positive 2 1 0.6667

After analysis, we found that the information confirmed as positive has a high probability of being incorrectly judged. Although the Accuracy and Sensitivity of the model were high, we still believed that this model needed improvement.

**5.3 Model optimization: the cost-sensitive method **

After careful observation, we found that the reason for the unsatisfactory experimental results was the existence of a large number of non-equilibrium data. In the data we got, the number of positive information was far less than the number of negative information. Therefore, we adopted the cost-sensitive method to improve the model.

The cost-sensitive method provided a new way for us to solve the problem. When solving the dichotomy problem, we changed the weight so that the model will be punished more severely if it misjudges a certain type of sample. Therefore, the final model can judge these kinds of samples with higher accuracy.

*D*(*k*) = (*w*_{k}_{1}*, w*_{k}_{2}*, . . . , w** _{km}*) ;

*w*

_{1}

_{i }_{=}

^{1}

*;*

_{m}*i*= 1

*,*2

*. . . m*(9) For binary classification problems, the weight coefficient of the

*k*

*weak classifier*

^{th }*G*

*(*

_{k}*x*) is

*α*

_{k }_{=}

^{1}

_{2log }1

*− e*

_{k}

*e** _{k}*(10)

And the final strong classifier is

*f*(*x*) = sign

^{(}∑^{K}* k*=1

*α*_{k}*G** _{k}*(

*x*)

)

(11)

where *α** _{k }*stands for the weighted classifier error rate

*e*

*=*

_{k }*P*(

*G*

*(*

_{k }*x*

*)*

_{i}*= y*

_{i}_{) = }∑^{m}*i*=1

*w*_{ki}*I *(*G** _{k }*(

*x*

*)*

_{i}*= y*

_{i}) (12)Team # 2109118 Page 13 of 24

In the problem we studied, misjudgment of positive information would bring more serious consequences. In other words, it was more meaningful to increase the recognition rate of positive information than to increase the recognition rate of negative information. Therefore, we gave positive information a higher weight. According to the existing data, we used 70% of the data as training data to generate a new model. Finally, we tested the improved model with another 30% as test data.

**5.4 The Result and accuracy of model **

Our test results for the improved model shown in Table 4, based on the test data (30% of the total data). We can see that the new model has become greatly optimized. Especially in improving the prediction accuracy of positive information. The new model will become used more efficiently in practice.

Table 4: The results of the improved training model

predict to be Negative ID Positive ID class error Accuracy sensitivity _{contain photos}^{negative 600 1 0.00170.997 0.750 }positive 1 3 0.2500 _{not contain photos}^{negative 22 1 0.43480.889 0.667 }positive 2 2 0.5000

At the same time, we learned how various factors influenced the spread of Asian giant hornets from this model. We measured the influence in two ways. In the figure, the larger the value is, the greater the influence of the factor is.

For the information with photos, we got results as shown in Figure 7. The degree of importance of each influencing factor obtained by the two measurement methods is similar. Latitude, considered to become the factor that has the greatest impact on the proliferation of Asian giant hornets.

Where interval is the time difference between Submission Date and Detection Date. The results of a test dataset that does not include photos shown in Figure 8. The results

Figure 7: Importance of each factor (including photos)

Team # 2109118 Page 14 of 24

These tests are slightly different from those obtained. In the analysis of the data set that does not contain photos, we learned that the four factors that have the greatest impact on the spread of Asian giant hornets are: Latitude, longitude, interval, and Detection Date.

**6 Task3: Priority in the allocation of resources **

In this section, we made further judgments on the two types of unverified and unprocessed data on the basis of the model established above. Combining all the data, we conducted follow-up investigations on information that was more likely to be positive. In the analysis, the accuracy of our prediction also decreased because part of the information did not contain pictures. In order to better allocate resources, we conducted k-means clustering analysis on the data.

**6.1 Analysis object **

Since the positive probability record deserves more attention, we mainly studied this part of the record. Since the consequences of misjudgment of positive are much more serious than misjudgment of negative, to be cautious, we selected data with a positive probability of prediction results above 0.3 for this analysis.

**6.2 The analysis process **

First, to calculate the within sum of squares of different cluster number K, we have:

*S*_{A }_{=}∑^{r}* i*=1

*n*_{i}^{(}* _{X}*¯

_{i }

_{− X}_{¯})

^{2}

*, S*

_{E }_{=}∑

^{r}*i*=1

∑^{n}*i **j*=1

^{(}*X*_{ij }* _{− X}*¯

*)*

_{i}^{2}(13)

* _{X}*¯

_{i }_{=}

^{1 ni}∑

^{n}*i*

*j*=1

Lastly, *X _{ij }, i *= 1

*,*2

*, . . . , r*(14)

**Then, for different clustering numbers K, we used Equation (13) and Equation(14) to calculate the total sum of squares between groups. **

Figure 8: Importance of each factor (excluding photos)

Team # 2109118 Page 15 of 24

*S** _{T }*=

*S*

*+*

_{E }*S*

*(15)*

_{A }In addition, *w *is used as the index to measure the clustering effect, and its calculation method is shown below:

_{w }_{=}within sum of squares

_{total sum of squares}(16)

We used different *K *to test *w *and got an optimal *K*. We constructed a scree plot with different *w *and *K*. In the figure, once the value of *w *tends to be flat, this indicates that a larger *K *value is not needed. From Figure 9, we can get that the optimal number of clusters K is 2.

**6.3 Results of the Analysis **

In the case of limited resources, we will give priority to the follow-up investigation of group 2. Table 5 shows the two clustering centers. From Table 5, we can know that:(1) group 1 contains 196 pieces of data and group 2 contains 400 pieces of data,(2) the probability of positive of group 2 is greater than that of group 1 (3) the picture similarity of group 2 is higher (4) the interval of group 2 is smaller. Therefore, it is wiser to choose group 2.

Table 5: Cluster centering

Group~~ Numbers Detection date Submission date Latitude Longitude Similarity Polarity Subjectivity Interval Probability of positive~~ 1 196 243.628 255.878~~ 47.863 -122.768 0.004 0.039 0.326 12.250 ~~0.442 2 400 352.140 355.058 47.651 -122.816 0.000 0.032 0.308 2.918 0.380

With the help of the t-SNE algorithm, we performed dimensionality reduction processing on the clustering results and visualized the results, as shown in Figure 10.

Figure 9: Choosing the optimal number of cluster

Team # 2109118 Page 16 of 24

Figure 10: The result of cluster analysis (t-SNE algorithm)

**7 Task4: Model updating **

In this section, we used incremental learning to update the model. With the increase of time, the number of sighting reports we collect will also increase. At this point, it is necessary to update the model. We explained our solution in the following aspects.

**7.1 Method introduction **

Incremental learning is to learn the new information added, and then gradually realize the update of the entire model. Batch learning, the traditional data update method, requires retraining the entire model when new information is encountered. Batch learning requires a lot of resources and puts forward higher requirements on computing equipment, while incremental learning requires very little resources in the update process. Compared with traditional solutions, incremental learning has huge advantages in practical applications.

**7.2 How to update the models **

In this part, we update the model based on the data acquired over a period of time. We update the following two sections.

**7.3 Incremental learning in the data processing model **

Based on the original model, we get the probability that the newly added information is positive. We extract the part of the new data that has a positive probability greater than 0.5, and perform manual judgment on this part of the data. Then, we extract the pictures whose artificial judgment result is Asian giant hornets. In addition, we use this part of the pictures as specimens and repeat the process of building the model above to construct new corresponding indicators. We take the length of time span contained in the two data sets as their respective weights, and calculate the weighted

Team # 2109118 Page 17 of 24

average of the new model indicators and the original model indicators. The index obtained after weighted average is used as the final index of the updated model. At this point, we have completed the update of the data processing model. In Figure 11, we show our incremental learning model.

**7.4 Incremental learning in the random forest model **

First, we extract the factors (such as Longitude, Latitude) which are used to build the random forest model from the newly added data. Then, we can get a series of new indicators by using the above method which is to build a random forest model. Then, we use the method used in the data processing model update to get a new random forest model. This new random forest model can make decision analysis in the next time period.

**7.5 The time we should update the model **

We know that the queen bee is the most important part of the colony. The queen bee of Asian giant hornets is an annual insect. According to the life cycle theory, the number of colonies of Asian giant hornets goes through four stages in a year: introduction, growth, maturity and decline. Therefore, the number of Asian giant hornets changes on a yearly basis. So, we only need to collect data once a year.

In order to select the optimal updating time of the model, we make graphs depicting Detection Date and Submission Date, which is shown in figure A. As can be seen from Figure 12, July, August and September are the peak periods for the public to discover suspected Asian giant hornets and submit reports. January is the month when the number of suspected Asian giant hornets discovered and reported by the public is the lowest. Therefore, we can get the most complete data by collecting data in January.

To sum up, considering the biological characteristics of Asian giant hornets and the completeness of collected data, we chose to update the model in January of each year.

Figure 11: The incremental learning model

Team # 2109118 Page 18 of 24

(a) Calendar figure of detection date in 2019 (b) Calendar figure of detection date in 2020

(c) Calendar figure of submission date in 2019 (d) Calendar figure of submission date in 2020

Figure 12: Calendar figure

**8 Task 5: The eradication of Asian giant hornets **

In this part, we use survival analysis to study the eradication of Asian giant hornets. After building the model, we test the research value of the model. Finally, we find out the factors that affect the survival of Asian giant hornets and take these as indicators of the eradication problem of Asian giant hornets.

**8.1 Model Construction **

Survival analysis refers to a series of analytical methods used to explore the time of occurrence of events that we are interested in. In our research, we make the following definitions: the event of interest is the discovery of Asian giant hornets, the initial event is the discovery of insects, the failure event is the report of insects and the survival time is the difference between the submission date and the detection date.

We use the Kaplan-Meier survival estimate model and the Cox proportional hazards model for analysis. We know that survival probability is the probability that the time of occurrence of events that we are interested in does not exceed a given time. The survival probability is defined by

Equation below:

*S *(*t** _{n}*) =

*S*(

*t*

_{n−}_{1})

(

_{1 }_{−}*d*_{n}* **r*_{n}* *

)

*h *(*t, X** _{i}*) =

*h*

_{0}(

*t*)

*×*exp (

*X*

_{i}*β*)

The Kaplan-Meier Survival Estimate model is a non-parametric method. We can use this method to calculate the probability of survival, as shown above.

Cox proportional hazards model can solve the relationship between multiple variables and events of interest. The definition of this model is shown above.

Team # 2109118 Page 19 of 24

**8.2 The Kaplan-Meier survival estimate model **

We use the KM model to analyze and process the data and get Table 6 and Figure 13. From Table 6, we know that the confidence interval of the time average is [232.045,261.924], and the confidence interval of the time median is [178.243, 339.757]. From Figure 14, we find that the longer the survival time, the fewer occurrences of Asian giant hornets. It also shows that it is meaningful for us to study the elimination of Asian giant hornets by means of survival analysis.

Table 6: Survival analysis: mean and median time

Average median level

_{estimate standard error}^{95 % confidence interval estimate}_{ standard error}95 % confidence interval

lower limit

upper limit lower limit upper limit

246.984 7.622 232.045 261.924 259.000 41.203 178.243 339.757

**8.3 Cox regression **

In addition, we also use a variety of ways to draw the survival analysis function, as shown in Figure A. These figures all verify that the survival analysis is effective to study this problem.

With the help of the Cox Proportional Hazards Model, we try to show that these attributes are important for the elimination of Asian Giant Hornets. We perform an Omnibus test on the coefficients of the model, and the results shown in Table 7. The significance of the model coefficient is 0.000, which is less than 0.05. Therefore, Cox model is effective for analysis.

Table 7: Omnibus test of model coefficients

_{lower limit}Overall (score) change from the previous step change from the previous piece

chi

square

degree of freedom

significance chi square

degree of freedom

significance chi-square degree of freedom

significance

49.340 34.937 5.000 0.000 47.447 5.000 0.000 47.447 5.000 0.000 The analysis results obtained by using the Cox model shown in Table 8. We find that: the

Figure 13: Survival analysis function of Asian giant hornets

Team # 2109118 Page 20 of 24 (a) Survival analysis function by means of covariate (b) A one-minus survival analysis function by the mean of the covariate

(c) Risk function by means of covariate (d) LML function by the mean of the covariable

Figure 14: Survival analysis function by covariate

The significance of Latitude is 0.010 which is less than 0.05, the significance of Similarity is 0.116 which is less than 0.2, the significance of Subjectivity is 0.054 which is less than 0.1.

Table 8: Variables in Equations

_{B SE Wald degree of freedom significance Exp(B)}95.0 % CI of Exp ( B ) lower limit upper limit

Latitude 6.801 2.623 6.725 1 0.01 899.134 5.264 153576.087 Longitude -0.476 1.307 0.133 1 0.715 0.621 0.048 8.04 Similarity 15.675 9.965 2.475 1 0.116 6421197.984 0.021 1.94771E+15 Polarity 1.344 4.254 0.1 1 0.752 3.833 0.001 16015.358 Subjectivity -3.562 1.849 3.709 1 0.054 0.028 0.001 1.065

Based on the results of survival analysis( he significance of each factor), we divide the indicators that measure the elimination of Asian giant hornets into the following three categories:

Key indicator (high credibility): Latitude. If there are a large number of low-latitude reports, it means that Asian giant hornets have been eliminated.

General indicators (average reliability) Similarity. If there are a lot of low similarity reports, it means that Asian giant hornets have been eliminated. Subjectivity. If there are a large number of higher value reports, it means that Asian giant hornets have been eliminated.

Reference index (low credibility) Longitude. If there are a large number of high-longitude

Team # 2109118 Page 21 of 24

reports, it means that Asian giant hornets have been eliminated. Polarity. If there are a lot of high-value reports, it means that Asian giant hornets have been eliminated.

**9 Sensitivity analysis **

We carry out the sensitivity analysis on two basic parameters of our optimized random forest model: *m *and *α*. *m *stands for the number of iterations for which boosting is run, while *α *stands for the weighted coefficient of weak learners. Firstly, we change the number of *m *from 30 to 60. The changes of classification error rate are listed below. We can see that the classification error rate remains a constant low level despite the change of *m*. Secondly, we adjust the calculation of *α*. There are three ways to calculate *α*. By default we calculated *α *in the way that was invented by

Breiman:

*α*_{k }_{=}^{1}_{2log }1 *− e*_{k}* *

*e** _{k}*(17)

The other two equation were invented by Freund and Zhu:

_{F reundα }_{= }_{ln}_{(}1 *− err *

* _{err}*) (18)

_{Zhuα }_{= }_{ln}_{(}1 *− err *

* _{err}*) (19)

According to Equation (17), Equation (18), and Equation (19), we can get error in Table 9, error rate in Table 10. From Table 10 we know the effect of *m *and *α *is not significant.

Table 9: The error of 31 models

error

0.0016529 0.0049587 0.0016529 0.0049587 0.0033058 0.0049587 0.0033058 0 0.0033058 0.0033058 0.0033058 0.0016529 0.0016529 0.0033058 0 0.0049587 0.0049587 0.0049587 0.0049587 0.0033058 0.0033058

0.0049587 0.0033058 0.0033058 0 0.0016529 0.0033058 0.0033058 0.0033058 0.0033058 0.0033058

Table 10: The error rate of 31 models

Error rate Accuracy Sensitivity

Breiman 0.0038 0.997 0.75

Freund 0.0037 0.997 0.5

Zhu 0.0026 0.998 0.75

Team # 2109118 Page 22 of 24

**10 Strengths and weaknesses **

**10.1 Strengths **

• **Avoid overfitting**: instead of the CNN convolutional neural network, we use the vector to extract the information from the graph and obtain the similarity by calculating the cosine similarity. This is more suitable for processing a relatively small size of data, and it will avoid overfitting.

• **Sensitive to true reports**: we use a cost-sensitive method to optimize the random forest model, considering the sample data is imbalanced. Our final prediction results reach a high degree of accuracy. The model can accurately determine the report result, and more importantly, it will hardly miss the true report of the Asian giant hornet.

• **Dynamical adjustment**: the incremental learning model is constructed to dynamically adjust the parameters in our model so that the model still has vitality with the development of time. Moreover, we use survival analysis to explore and classify the factors that affect the Asian giant hornet. It gives us a dynamic picture of the ecology of Washington State.

**10.2 Weaknesses **

• **Lack of relevant data**: with the lack of information referring to the climate and population of Washington, we are unable to give a precise prediction of the spread of Asian giant hornets.

• **Limited in theory**: due to the lack of application of actual scenarios and the inability to obtain real-time data provided by WSDA, our incremental learning model has not been reasonably modified.

**11 Memorandum **

Date: February 9, 2021

To: Chief Administrator The Washington State Department of Agriculture

From: Team # 2109118

Subject: An Automatic Identification System Based on Online Data Mining

The discovery of Asian giant hornets in Washington, DC, for the first time, has sparked much discussion. Asian giant hornets are known not only as the world’s largest wasp species, but also as extremely dangerous insects. Asian giant hornets are also well known as murder hornets. Not only can they destroy a hive in a short period of time, but they also cause dozens of deaths every year. Therefore, for residents of Washington and other insects, the Asian giant hornet is a huge threat. The appearance of Asian giant hornets is not a pleasant thing and even causes many people to panic. In order to appease people’s anxiety and ensure their safety, it is necessary for the state government in Washington to take action to understand the distribution of Asian giant hornets and prevent their continued spread.

Team # 2109118 Page 23 of 24

In order to better understand the transmission status of Asian giant hornets and help the government make better decisions with limited resources, we have established many models in this paper. At the same time, we set up many assumptions to better integrate with real life. Based on the information of sighting reports on the website established by the government, we obtained the following conclusions through rational use of the mathematical model:

• The government should update the model in January every year, on the premise that the model is updated by the incremental learning method we have established. The selection of this time is based on the consideration of biological characteristics and completeness of data collection. After analysis, we found that the number of bee colonies of Asian giant hornets changes in one year. In addition, January is the best time for us to collect data because people rarely submit new data this month.

• We get a series of indicators (such as Longitude, Polarity) to help us judge the problem of the elimination of Asian giant hornets. The reliability of these indicators is different. Among them, the most reliable factor is Longitude. We can know that if there are a large number of high-longitude reports, we can make a judgment that Asian giant hornets have been eliminated.

• The transfer of Asian giant hornets is fast and can reach 884 square kilometers per year under ideal conditions. Since Asian giant hornets are native to Asia, they are alien species from Washington. In America, due to the lack of natural enemies, the transfer speed of Asian giant hornets will be closer to that under ideal conditions. This figure is indeed shocking.

• Asian giant hornets are concentrated in the northwest of Washington, because the data confirmed as positive are concentrated in this area. In addition, the number of public sighting reports received in the northwest of Washington is also the largest. There are various signs that this area is dangerous.

• Asian giant hornets may have been distributed throughout Washington State. We have received unverified sighting reports in various places in Washington. Therefore, all regions cannot be vigilant.

• Asian giant hornets may be spread along the river. The number of sighting reports received in the riverside area is greater than other areas. In addition, this also meets the characteristics of many biological transfers.

Based on our research conclusions and our observations, we give the following suggestions.

• We hope you can allocate relatively more resources in the northwest of Washington. Because the northwest of Washington should be a concentrated area of Asian giant hornets, it is the best choice to allocate more resources here.

• We recommend that you adopt our model. We used a lot of optimization algorithms in the model. Our model will help government departments save a lot of resources in practice. Not only will it save a lot of human resources, but it will also save the government a lot of money by reducing the need for computing equipment.

Team # 2109118 Page 24 of 24

• We hope you can emphasize to the staff: Once Asian giant hornets are found, they need to take faster action to eliminate the hornets. This is because the spread of Asian giant hornets is very fast. In addition, according to our experience, strong artificial control can effectively inhibit the proliferation of Asian giant hornets.

• We expect you to optimize the website for sighting submission. We suggest that you ask your experts to list the characteristics of Asian giant hornets and look-alike species and ask the public to select the characteristics of their insect sightings. Compared with simply leaving a note in Notes, this method not only enables the staff to obtain more effective information, but also improves the accuracy of the model for better judgment, analysis and prediction.

**References What happens if you get bit by a killer hornet?**

[1] Bertolino, S., Lioy, S., Laurino, D., Manino, A., & Porporato, M. (2016). Spread of the invasive yellow-legged hornet Vespa velutina (Hymenoptera: Vespidae) in Italy. Applied entomology and zoology, 51(4), 589-597.

[2] Robinet, C., Suppo, C., & Darrouzet, E. (2017). Rapid spread of the invasive yellow legged hornet in France: the role of human mediated dispersal and the effects of control measures. Journal of Applied Ecology, 54(1), 205-215.

[3] Ibáñez-Justicia, A., & Loomans, A. J. (2011). Mapping the potential occurrence of an invasive species by using CLIMEX: case of the Asian hornet (Vespa velutina nigrithorax) in The Netherlands. In Proceedings of the Netherlands Entomological Society Meeting (Vol. 22, pp. 39-46).

[4] Nguyen, H. V., & Bai, L. (2010, November). Cosine similarity metric learning for face verification. In Asian conference on computer vision (pp. 709-720). Springer, Berlin, Heidelberg.

[5] Li, X. (2013). Application of random forest model in classification and regression analysis. Journal of Applied Entomology, 50(04), 1190-1197.

[6] Zhu, M., Xia, J., Yan, M. L., Zhang, S. Y., Cai, G. L., Yan, J., & Ning, G. M. (2014). Feature selection and optimization of random forest modeling. In Applied Mechanics and Materials (Vol. 687, pp. 1416-1419). Trans Tech Publications Ltd.

[7] Ma, L., Fu, T., Blaschke, T., Li, M., Tiede, D., Zhou, Z., … & Chen, D. (2017). Evaluation of feature selection methods for object-based land cover mapping of unmanned aerial vehicle imagery using random forest and support vector machine classifiers. ISPRS International Journal of Geo-Information, 6(2), 51.

[8] Sheng, V. S., & Ling, C. X. (2006, July). Thresholding for making classifiers cost-sensitive. In AAAI (Vol. 6, pp. 476-481).

[9] Liu, D., & Sun, K. (2019). Random forest solar power forecast based on classification optimization. Energy, 187, 115940.

[10] Mueller, D. C. (1972). A life cycle theory of the firm. The Journal of Industrial Economics, 199-219.

Human Health | Washington State Department of Agriculture

This is What Happens When a Murder Hornet Stings You (newsweek.com)

What damage do lanternflies do? (rebellionresearch.com)

Asian giant hornet – Wikipedia

**What happens if you get bit by a killer hornet?**