Wednesday, June 5, 2019
Internet of Things Paradigm
network of Things ParadigmIntroductionAccording to 2016 statistical forecast, there argon almost 4.77 one thousand million number of mobile phone users in orbiculately and it is expected to pass the five billion by 2019. 1 The main attribute of this signifi pott increasing propensity is collectable to increasing popularity of smartphones. In 2012, about a quarter of all mobile users were smartphone users and this ordain be doubled by 2018 which mean there argon be more than(prenominal) than 2.6 million smartphone users. Of these smartphone users more than quarter are using Samsung and Apple smartphone.Until 2016, there are 2.2 million and 2 million of apps in google app computer memory and apple ancestry respectively. Such explosive growth of apps gives potential benefit to developer and also companies. There are about $88.3 billion revenue for mobile exercise market.Prominent exponents of the IT industry estimated that the IoT paradigm forget generate $1.7 trillion in v alue added to the global economy in 2019. By 2020 the Internet of Things device will more than double the size of the smartphone, PC, tablet, connected car, and the wearable market combined.Technologies and services belonging to the Internet of Things eat generated global revenues in $4.8 trillion in 2012 and will reach $8.9 trillion by 2020, growing at a compound annual rate (CAGR) of 7.9%.From this impressive market growth, malicious attacks also agree been increase dramatically. According to Kaspersky Security Network(KSN) data report card, there has been more than 171,895,830 malicious attacks from online resources among word wide. In second quarter of 2016, they have detected 3,626,458 malicious installation packages which is 1.7 measure more than first quarter of 2016. Type of these attacks are broad such as RiskTool, AdWare, Trojan-SMS, Trojan-Dropper, Trojan, Trojan-Ransom,Trojan-Spy,Trojan-Banker,Trojan-Downloader,Backdoor, etc..http//resources.infosecinstitute.com/int ernet-things-much-exposed-cyber-threats/grefUnfortunately, the rapid diffusion of the Internet of Things paradigm is not accompanied by a rapid improvement of efficient security solutions for those smart objects, while the criminal ecosystem is exploring the technology as new attack vectors.Technological solutions belonging to the Internet of Things are forcefully entering our daily bearing. Lets think, for example, of wearable devices or the SmartTV. The greatest problem for the development of the paradigm is the low perception of the cyber threats and the practical impact on privacy.Cybercrime is aware of the difficulties faced by the IT community to define a shared strategy to mitigate cyber threats, and for this reason, it is plausible that the number of cyber attacks against smart devices will rapidly increase.As long there is money to be made criminals will continue to take advantage of opportunities to pick our pockets. While the battle with cybercriminals can seem daunting , its a fight we can win. We only need to break one link in their chain to s elapse them dead in their tracks. Some tips to success position patches quicklyEliminate unnecessary applicationsRun as a non-privileged userIncrease employee awarenessRecognize our weak pointsReducing the threat surfaceCurrently, some(prenominal) major app store companies, Google and Apple, takes different position to approach spam app spying. One takes an active and the other with passive approach.There is strong request of malware detection from globalBackground (Previous Study)The paper Early Detection of Spam Mobile Apps was published by dr. Surangs. S with his colleagues at the 2015 International World Wide Web conferences. In this conference, he has been emphasised importance of early detection of malware and also introduced a unique idea of how to detect spam apps. Every market operates with their policies to deleted application from their store and this is done thru continuous human intervention . They want to find reason and pattern from the apps deleted and identified spam apps.The diagram simply illustrates how they approach the early spam detection using manual labelling.Data PreparationNew dataset was prepared from previous study 53. The 94,782 apps of initial seed were curated from the list of apps obtained from more than 10,000 smartphone users. Around 5 months, researcher has been collected metadata from Goole Play Store about application name, application rendering, and application category for all the apps and discarded non- side of meat interpretation app from the metadata.Sampling and Labelling ProcessOne of cardinal process of their research was manual labelling which was the first methodology proposed and this allows to identify the reason behind their removal.Manual labelling was proceeded around 1.5 month with 3 subscribers at NICTA. Each reviewer labelled by heuristic checkpoint points and majority reason of voting were denoted as following Graph3. The y identified 9 key reasons with heuristic checkpoints. These full list checkpoints can be find out from their technical report. (http//qurinet.ucdavis.edu/pubs/conf/www15.pdf)In this report, we only list checkpoints of the reason as spam.Graph3. Labelled spam data with checkpoint reason.Checkpoint S1-Does the app exposition describe the app flow clearly and concisely?century word bigrams and trigrams were manually conducted from previous studies which describe app functionality. There is high probability of spam apps not having clear comment. Therefore, 100 lecture of bigrams and trigrams were compared with each description and counted frequency of occurrence.Checkpoint S2-Does the app description contain too much details, incoherent text, or unrelated text?literary mode, known as Stylometry, was use to map checkpoint2. In study, 16 features were listed in table 2.Table 2. Features associated with Checkpoint 2Feature1 conglomeration number of characters in the description2Total number of run-in in the description3Total number of sentences in the description4 ordinary word length5Average sentence length6Percentage of upper case characters7Percentage of punctuations8Percentage of numeric characters9Percentage of common English row10Percentage of personal pronouns11Percentage of emotional oral communication12Percentage of misspelled word13Percentage of wrangle with rudiment and numeric characters14Automatic readability index(AR)15Flesch readability s kernel(FR)For the video, feature selection of greedy method was used with max depth 10 of conclusiveness tree classification. The performance was optimized by asymmetric F-Measure 55They found that Feature number 2, 3, 8, 9, and 10 were most prejudicedand spam apps tend to have less wordy app description compare to non-spam apps. About 30% spam app had less than 100 words description.Checkpoint S3 Does the app description contain a noticeable repeating of words or key words?They used vocabulary richn ess to deduce spam apps.Vocabulary Richness(VR) =Researcher expected low VR for spam apps according to repetition of keywords. However, result was opposite to expectation. Surprisingly VR close to 1 was believably to be spam apps and none of non-spam app had high VR result. This might be due to terse style of app description among spam apps.Checkpoint S4 Does the app description contain unrelated keywords or references?Common spamming technique is adding unrelated keyword to increase search result of app that topic of keyword can set off significantly. New strategy was proposed for these limitations which is counting the mentioning of popular applications name from apps description.In previous research name of top-100 apps were used for counting number of mentioning.Only 20% spam apps have mentioned the popular apps more than once in their description. Whereas, 40 to 60 % of non-spam had mention more than once. They found that many of top-apps have social media interface and ro oter pages to keep connection with users. Therefore, theses can be one of identifier to discriminate spam of non-spam apps.Checkpoint S5 Does the app description contain excessive references to other applications from the same developer?Number of times a developers other app names appear.Only 10 spam apps were considered as this checkpoint because the description contained links to the application rather than the app names.Checkpoint S6 Does the developer have multiple apps with approximately the same description?For this checkpoint, 3 features were consideredThe total number of other apps developed by same developer.The total number of apps that written in English description to measure description similarity.Have description Cosine similarity(s) of over 60%, 70%, 80%, and 90% from the same developer.Pre-process was required to calculate the cosine similarity Firstly, converting the words in lower case and removing punctuation symbols.Then calibrate each document with word frequ ency vector.Cosine similarity equationhttp//blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/They observed that the most discriminative of the similarity between app descriptions.Only 10% 15% of the non-spam had 60% of description similarity between 5 other apps that developed by same developer. On the other hand, more than 27% of the spam apps had 60% of description similarity result. This evidence indicates the tendency of the spam apps multiple cone with similar app descriptions.Checkpoint S7 Does the app identifier (applied) make grit and have some relevance to the functionality of the application or does it appear to be auto generated?Application identifier(appid) is unique identifier in Google Play Store, name followed by the java package naming convention. Example, for the facebook , appid is com.facebook.katana.For 10% of the spam apps the average word length is higher than 10 and it was so only for 2%-3% of the non-spa m apps. None of the non-spam apps had more than 20% of non-letter bigram appear in the appid, whereas 5% of spam apps had.Training and ResultFrom 1500 of random sampling data 551 apps (36.73%) were suspicious as spam. MethodsAutomationWe used Checkpoint S1 and S2 for data watchfulness due to its comparability and highest number of agreement from reviewers. Due to limitation of accessibility for collect description reason only 100 sample was used for the testing.We have automated checkpoint S1 and S2 according to following algorithm. Collected data were used log transformation to modify. This can be valuable both for making patterns in the data more interpretable and for helping to meet the assumptions of inferential statistics.To make a code most time consuming part was description collection which takes more than two weeks to find and store. The raw data directed the description link for appID. However, many of them where not founded due to old version or no more available. So we searched all this info manually from the web and founded description was saved as a file which named as appID. (Diagram.) This allowed us to recall the description more efficiently in automation code.S1 was automated by identified 100 word-bigrams and word-trigrams that are describing a functionality of applications. Because there is high probability of spam app doesnt have these words in their description, we have counted number of occurrence in each application.Full list of these bigrams and trigrams found in Table 1.Table 1. Bigrams and trigrams using the description of top appsplay gamesare availableis the gameapp for androidyou canget notifiedto findlearn howget youris used toyour phoneto searchway tocore functionalitya simplematch youris a smartphoneavailable forapp forto playkey featuresstay in touchthis appis availablethat allowsto enjoytake care ofyou have toyou tocan you beatbuy youris effortlessits easyto usetry toallows youkeeps youaction gametake advantagetap thetake a opinionsave yourmakes it easyfollow whatis the freeis a globalbrings togetherchoose fromis a free publish moreplay ason the gomore informationlearn moreturns onis an appface the challengesgame fromin your pocketyour deviceon your phonemake your lifewith androidit helpsdelivers theoffers essentialis a toolfull of featuresfor androidlets youis a simpleit givessupport forneed your helpenables yourgame ofhow to playat your fingertipsto discoverbrings youto learnthis gameplay withit bringsnavigation appmakes mobileis a funyour answerdrives youstrategy gameis an easygame onyour wayapp whichon androidapplication whichtrain yourgame whichhelps youmake yourS2 was second highest number of agreement from three reviewers in previous study. Among 551 identified spam apps, 144 apps were substantiate by S2, 63 from 3 reviewers and 81 from 2 reviewer agreed.We knew that from pre-research result, total number of words in the description, Percentages of numeric characters, Percentage of non-alphab et characters, and Percentage of common English words will give most distinctive feature. Therefore, we automated total number of words in the description and Percentage of common English words using C++.Algorithm 1. Counting the total number of bi/tri-grams in the descriptionFrom literature , they used 16 features of to find the information from checkpointS2. This characterization was done with wrapper method using decision tree classifier and they have found 30% of spam apps were have less than 100 words in their description and only 15% of most popular apps have less than 100 words. We extracted simple but key point from their result which was number of words in description and the percentage of common English words. This was developed in C++ as followed.Algorithm 2. Counting the total number of words in the descriptionint count_Words(stdstring remark_text)int number_of_words =1for(int i =0 i if(input_texti == )number_of_words++return number_of_wordsPercentage of common English words has not done properly due to difficulty of standard selection. However, here is code that we will develop in future study.Algorithm 3. Calculate the Percentage of common English words(CEW) in the descriptionInt count_CEW(stdstring input_text)Int number_of_words=1For(int iwhile(CEW.eof()if(strcmp(input_texti,CEW)number_of_words++elsegetline(readFile, CEW)return number_of_wordsInt percentage(int c_words, int words)return (c_words/words)*100NormalizatonWe had variables between min, max for S1 and S2. Because of high skewness of database, normalization was strongly required. Database normalization is the process of organizing data into tables in such a way that the results of using the database are always unambiguous and as intended. Such normalization is intrinsic to relational database theory.Using stick out, we had normalized data as following diagram.Thru normalization, we could have result of transformed data between 0 and 1. The range of 0 and 1 was important for later pr ocess in LVQ.Diagram. Excel spread sheet of automated data(left) and normalized data (right)After transformation we wanted to test data to show how LVQ algorithm works with modified attributes. Therefore, we sampled only 100 data from modified data set. Even the result was not significant, it was important to test. Because, after this step, we can add more attributes in future study and affirmable to adjust the calibration. We have randomly sampled 50 entities from each top rank 100 and from pre-identified spam data. Top 100 ranked apps was assumed and high likely identify as non-spam apps.Diagram.Initial ResultsWe used the statistical package python to perform Learning Vector Quantification.LVQ is prototype-bases supervised classification algorithm which belongs to the field of Artificial nervous Networks. It can have implemented for multi-class classification problem and algorithm can modify during training process.The information processing objective of the algorithm is to prep are a set of codebook (or prototype) vectors in the domain of the observed input data samples and to use these vectors to classify unseen examples.An initially random pool of vectors was prepared which are past exposed to training samples. A winner-take-all strategy was employed where one or more of the most similar vectors to a given input pattern are selected and adjusted to be closer to the input vector, and in some cases, further away from the winner for runners up. The repetition of this process results in the distribution of codebook vectors in the input space which approximate the underlying distribution of samples from the test datasetOur experiments are done using only the for the manufactured products due to data size. We performed 10-fold cross validation on the data. It gives us the average value of 56%, which was quite high compare to previous study considering that only two attributes are used to distribute spam, non-spam.LVQ program was done by 3 steps Euclidean Di stanceBest Matching UnitTraining Codebook Vectors1. Euclidean Distance.Distance between two rows in a dataset was required which generate multi-dimensions for the dataset.The formula for calculating the outstrip between datasetWhere the difference between two datasets was taken, and squared, and summed for p variablesdef euclidean_distance(row1, row2)distance = 0.0for i in range(len(row1)-1)distance += (row1i row2i)**2return sqrt(distance)2. Best Matching UnitOnce all the data was converted using Euclidean Distance, these new piece of data should sorted by their distance.def get_best_matching_unit(codebooks, test_row)distances = list()for codebook in codebooksdist = euclidean_distance(codebook, test_row)distances.append((codebook, dist))distances.sort(key=lambda tup tup1)return distances 003. Training Codebook VectorsPatterns were constructed from random feature in the training datasetdef random_codebook(train)n_records = len(train)n_features = len(train 0)codebook = trainrandran ge(n_records)i for i in range(n_features)return codebookFuture workDuring write process, I found that data collection from Google Play Store can be automated using Java client. This will induce number of dataset and possible to improve accuracy with high time saving. Because number of attributes and number of random sampling, result of the research is appropriate to call as significant result. However, basic modeling was developed to improve accuracy.AcknowledgementIn the last summer, I did some research reading work under the supervision of Associate Professor Julian Jang-Jaccard. Ive got really great support from Julian and INMS. Thanks to the financial support I received from INMS that I can fully focused on my pedantic research and benefited a great deal from this amazing opportunity.The following is a general report of my summer researchIn the beginning of summer, I examine the paper A Detailed Analysis of the KDD CUP 99 Data Set by M. Trvallaee et. al. This gave basic idea of how to handle machine learning techniques. flak of KNN and LVQMain project was followed from a paper Why My App Got Deleted Detection of Spam Mobile Apps by Suranga Senevirane et. al.I have tried my best to keep report simple yet technically correct. I hope I succeed in my attempt.ReferenceAppendixModified DataNumber of Words in thousandsbigram/tr-gramIdentified as spam(b)/not(g)0.0840b0.180b0.1210b0.0091b0.2410b0.4520b0.1051b0.1980b0.6921b0.2581b0.2561b0.2250b0.0520b0.0520b0.0210b0.1881b0.1881b0.0921b0.0980b0.1881b0.1611b0.1070b0.3750b0.1950b0.1120b0.111g0.1491g0.3681g0.221g0.1211g0.1631g0.0721g0.0981g0.3121g0.2821g0.2291g0.2561g0.2980g0.0920g0.1890g0.1341g0.1571g0.2531g0.121g0.341g0.571g0.341g0.3461g0.1261g0.2411g0.1621g0.0840g0.1590g0.2531g0.2311g
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment