Learning Multi-Label Topic Classiﬁcation Of ATA News Articles

3y ago
160 Views
424.67 KB
6 Pages
Last View : 1d ago
Transcription

3(in our case, an article), which result (in our case,topic label) is most relevant to the query? A commonweighting scheme used in search and informationretrieval is term frequency–inverse document frequency(tf-idf), a numerical statistic that reflects how importanta word is to a document in a collection or corpus. Itis particularly used in web-search to rank pages bykeyword. We construct a simple variant of this schemefor our problem.The first element is the term frequency (tf), i.e. thenumber of times that token w occurs in an article a,summed across all the articles in a particular class:Pn(a)Xj 1 1{w xj } w, tf (w) Pn(a)articles aj 1 xjIntuitively, because we are summing over all thearticles in a given class, the tf-score captures both thefrequency of a word in each article and the frequencyof a word across articles in a given class; words thatappear with high frequency across many articles in aclass will be assigned particularly high tf-scores. Next,the inverse document frequency (idf) is a measure ofwhether a particular term is common or rare across thewhole corpus of articles. It is obtained by dividing thetotal number of words in the corpus by the count of theinstances of the particular word in the data, and thentaking the logarithm of that quotient:P(a) !articles a, words j xj w, idf (w) logP(a)articles a xwIntuitively, words that occur frequently in a specificclass but infrequently in the corpus in general constituehigh-information words for a given class, while thosethat appear often across the corpus are less informative. Multiplying the tf-score of a word by its itfscore allows us to take both aspect into consideration.Thus, we the compute for each token the tf-idif score: w, tf idf (w) tf (w) idf (w).For each class we short-listed the 100 most relevantwords - the highest information words for each class and used only them to score our testing examples. For agiven testing example, we compute a score per category:each time a token in the testing article also appears inthe short-list of the category, we add the score of theword to the total score for that category:Xscore(category c) tf idf (w)word in wFigure 4 shows a plot of the top 100 TFIDF values forBusiness. The shape of the function reflects the intuitionthat one can capture very high information about a classwith only a few words - and in fact the plot of thenext 500 words shows a further exponential drop in tdidfvalues.Figure 4. Top 100 TFIDF values for the Business categoryB. Learning a thresholdWe now have a final score per category for our testarticle, but we still have to make a decision on whetherit belongs to each class or not. For this we can use twodifferent machine learning techniques: kmeans, whichwe implemented, and logistic regression, which we didnot but describe briefly below.1) Threshold with k-means: The result of the scoringof a testing article often looks like this:Figure 5. Idealized example of learning a selection threshold withk-meansOur goal is to cluster these points in two differentcategories: ”high” scores and ”low” scores. For the highscores, we will predict 1, meaning that the article belongsto those classes, and for the low scores we will predict 0,meaning that the article does not. An advantage of the kmeans approach is that since threshold selection for eacharticle is independent, we are able to select a thresholdand make a reasonable prediction from the very first article we score. This feature of k-means is also satisfyingfrom the perspective of finding a learning method that ismore similar to the way human readers classify - we canpredict scores immidiately and independetly of previous

5words such as ’year’ and ’month’ turned out to be largelyresponsible for the high rate of false positives. Indeed,if one of these words appear in the article, the article isvery likely to be tagged business whereas the presenceof this word actually doesn’t mean much.Because we wanted to avoid removing these words arti-Figure 8. Error determined using top 100 td-idf words for each class,and tested on 500 articlesformal equation: α w, idf (w) exp P 2(a) articles a words j xj(a)Particles axw(a) !articles a words j xjP(a)articles a xwPlog(1)Near 0, the term inside the exponential determines theshape of the funcion, while near , the term insidethe log is prevalent. This approach yielded the followingresults:Figure 7. Top 20 TFIDF words and values for the Business categoryficially from our top 100 tf-idf lists and were concernedthat other low-value words might take their place if wedid, we devised a scoring scheme that would place aheavier emphasis on the idf-score of a word such thatwords with low idf scores would be unlikely to have hightf-idf scores. In the original idf model, many of the hightfidf but low information had extremely low idf-scores,reflecting their generally high frequency across the corpus.So we needed to modify the tfidf function such thatwords with low idf values would be mapped to very lowoverall tfidf scores. On the other hand, we wanted an idffunction that ensured that words with high idf valuesstayed in O(log( inverse document f requency)).We handcrafted the following function with the desiredproperties: The resulting model is expressed by theFigure 9. Error determined using top 100 td-idf words for each class,and tested on 500 articlesThe results are better on average, but the errors arenow evenly distributed across the classes. Neverthelessthe results are still far from the performance of NaiveBayes, which leaves us the opportunity for further investigation.On the whole our research validated the commonapproach of using binary-classifiers to learn multi-labeltopic classifications for new articles. The tfidf approachcaptures some interesting aspects of the intuition behindhow people may classify news articles, but we werenot able to lower the error produced by the tfidf modelsufficiently to make it practically competitive with thebinary classification scheme. Future research might lookinto alternate methods for scoring functions based ontfidf and the notion of finding high information words toclassify multi-label articles.

6R EFERENCES[1] Rong-En Fan and Chih-Jen Lin. John W. Dower A Study onThreshold Selection for Multi-label Classification. Technical Report, National taiwan University, 2007.[2] Arzucan Ozgur, Levent Ozgur, and Tunga Gungor. Text Categorization with Class-Based and Corpus-Based Keyword Selection.Proceedings of the 20th international symposium on computerand information sciences, Springer-Verlag (2005), pp. 607-616.E. H. Norman

News article topic classiﬁcation can enable automatic tagging of articles for online news repositories and the aggregation of news sources by topic (e.g. google news), as well as provide the basis for news recommendation systems. More broadly, given the social and political im-pact of news and media, communications specialists and

Related Documents:

Multi-class classi cation: multiple possible labels . We are interested in mapping the input x 2Xto a label t 2Y In regression typically Y Now Yis categorical Zemel, Urtasun, Fidler (UofT) CSC 411: 03-Classi cation 5 / 24 . Classi cation as Regression Can we do this task using what we have learned in previous lectures? Simple hack .

essential tool to calibrate and train these interfaces. In this project we developed binary and multi-class classi ers, labeling a set of 10 performed motor tasks based on recorded fMRI brain signals. Our binary classi er achieved an average accuracy of 93% across all pairwise tasks and our multi-class classi er yielded an accuracy of 68%.

Experimental Design: Rule 1 Multi-class vs. Multi-label classification Evaluation Regression Metrics Classification Metrics. Multi-classClassification Given input !, predict discrete label " . predicted, then a multi-label classification task Each "4could be binary or multi-class.

16 SAEs 4 5 5 1 " GBoost 430 numeric properties Classi!er hierarchy categoric properties 3. Data Preprocessing Feature extraction Imbalanced learning Results Classi!cation "" "" "non-430!nancial non-!n 38 39 " holding " GBoost Ensemble classi!er 430 SVM ense

Nella Scuola Primaria sono presenti n. 5 classi seconde con organizzazione oraria di 27 ore settimanali, di cui n. 3 classi alla sede “R. Sardigno” e n. 2 classi alla sede “V. Valente”. Le classi sono composte da un minimo di 14 ad un massimo di 21 alunni; nella classe II A, sono iscritti 1 alunno

In this study, we seek an improved understanding of the inner workings of a convolutional neural network ECG rhythm classi er. With a move towards understanding how a neural network comes to a rhythm classi cation decision, we may be able to build interpretabil-ity tools for clinicians and improve classi cation accuracy. Recent studies have .

6 1 128377-07190 label,loose parts 1 7 2 128377-07400 label,danger 2 8 2 128377-07420 label,danger 1 9 2 128377-07450 label,caution 1 10 2 128377-07460 label,caution 1 11 1 129670-07520 label, bso2 1 12 1 196630-12980 label 1 13 1 177073-02431 label 1 (c)copy rights yanmar co.,ltd an

Topic 5: Not essential to progress to next grade, rather to be integrated with topic 2 and 3. Gr.7 Term 3 37 Topic 1 Dramatic Skills Development Topic 2 Drama Elements in Playmaking Topic 1: Reduced vocal and physical exercises. Topic 2: No reductions. Topic 5: Topic 5:Removed and integrated with topic 2 and 3.