Coword And Cluster Analysis For The Romance Of The Three .

2y ago
19 Views
2 Downloads
898.98 KB
8 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Aiyana Dorn
Transcription

HindawiWireless Communications and Mobile ComputingVolume 2021, Article ID 5553635, 8 pageshttps://doi.org/10.1155/2021/5553635Research ArticleCoword and Cluster Analysis for the Romance of theThree KingdomsChao Fan121,2and Yu Li1,2The School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, ChinaJiangsu Key Laboratory of Media Design and Software Technology, Jiangnan University, Wuxi 214122, ChinaCorrespondence should be addressed to Chao Fan; fanchao@jiangnan.edu.cnReceived 1 March 2021; Revised 12 March 2021; Accepted 19 March 2021; Published 1 April 2021Academic Editor: Shan ZhongCopyright 2021 Chao Fan and Yu Li. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.The Romance of the Three Kingdoms (RTK) is a classical Chinese historical novel by Luo Guanzhong. This paper establishes aresearch framework of analyzing the novel by utilizing coword and cluster analysis technology. At the beginning, we segmentthe full text of the novel, extracting the names of historical figures in the RTK novel. Based on the coword analysis, a socialnetwork of historical figures is constructed. We calculate several network features and enforce the cluster analysis. In addition, amodified clustering method using edge betweenness is proposed to improve the effect of clustering. Finally, both quantified andvisualized results are displayed to confirm our approach.1. IntroductionThe Romance of the Three Kingdoms, written by LuoGuanzhong, is generally considered to be one of the four greatclassical novels in Chinese literature. It describes the turbulentyears from the end of the Han dynasty to the Three Kingdoms(Wei, Shu, and Wu) era in Chinese history. More than 1000personalities are vividly portrayed in the historical novel.In this research, text of original novel is divided into anumber of sentences. According to coword analysis, there isa certain intrinsic relationship between the two words whenthey appear in the same document. Thus, we calculated thefrequency of cooccurrences for two names in a sentence.The character name is reckoned as the node and the cooccurrence as the link, so that an undirected network can be established. Furthermore, various network features are computedto analyze relationships of characters in the novel. Clusteranalysis is employed to explore the hierarchical structure ofRTK. Finally, an improved clustering algorithm by cuttinghigh-betweenness edges is proposed, which performs betterthan the common approach in clustering effect.This manuscript is organized as follows. Section 2 givesrelated work of this paper. Data preparation is discussed inSection 3. Sections 4 and 5 express the network feature anal-ysis, cluster analysis, experiments, and the analysis of results.Conclusions are drawn in Section 6.2. Related WorkEarly research about the RTK concentrates on qualitativeanalysis, such as the writing style, genealogy, and characters.Later, a quantitative approach was adopted to analyze thenovel. Coword analysis is such a method of importance,which was first devised by French scholars and introducedinto the information science field by Callon [1]. Accordingto the theory of coword analysis, there is a close connectionbetween two words when they appear in a sentence. Morecooccurrences of the two words indicate the closer relationship between them. In this paper, we consider the cooccurrence of character names in a sentence of the RTK novel.Numerous researches on literature analysis have beendone based on the technologies of coword analysis.Ravikumar et al. [2] inspect 959 articles in scientometricsbased on the coword analysis approach and find that thetopics in publication are changing to new themes. As forthe medical literature, there is a study utilizing this toolto process them over a span of thirty years [3]. Anotherwork focuses on past themes and future trends in medical

2Wireless Communications and Mobile ComputingFigure 1: A network of character names (top 80 in node frequency).tourism research [4]. Employing the coword analysis, someresearchers attempt to identify the themes and trends ofmain knowledge areas including engineering, health, publicadministration, and management [5]. Moreover, a cowordnetwork is established to analyze the relationship of characters in the Dream of the Red Chamber [6]. Wang et al. build asimilar network for the Romance of the Three Kingdoms [7].After creating a social network based on coword analysis, the cluster analysis is carried out by performing a hierarchical clustering algorithm. Two types of algorithm areoften implemented when moving up the hierarchy. Thedivisive approach of clustering reckons all data as one cluster and performs splits, which is used in many research [8].Nevertheless, the agglomerative hierarchical clustering is abottom-up method with many variants [9]. It merges thetwo most similar clusters at each time. The agglomerativemethod is exploited in this work because it can provide avisual expression of the clustering results.acquired a name list of RTK characters through the Internetand added it to the dictionary of ICTCLAS. Then, the lexicalanalysis is executed to segment Chinese sentences into wordswhere names of characters can be found.3.2. Creation of Character Name Network. Based on cowordanalysis, an undirected network of character names can becreated by counting the cooccurrences of two names insentences. We treated full name, its courtesy name, andabbreviated name as one name. For example, “Cao Cao” isequal to “Cao Mengde” and “Mengde,” which means thethree names refer to a single person of “Cao Cao.”The final constructed network of character names has1,133 nodes and 5,844 links. As depicted in Figure 1, the sizeof a node indicates the count of the character name in thenovel and the thickness of a link corresponds to thefrequency of two characters that appear together.4. Network Feature Analysis3. Data Preparation3.1. Building RTK Corpus and Preprocessing. As many data ofthe novel can be downloaded from the Internet, we selected ahigh-quality text document (https://72k.us/file/22215238408791478) in Chinese character, establishing the RTKcorpus by cleaning the original data. Some words witherrors were modified, and the wrong punctuations wereremoved manually.The raw text is preprocessed using the natural languageprocessing toolkit ICTCLAS (http://ictclas.nlpir.org/). We4.1. Degree Distribution. As the degree of a node is the number of links adjacent to it, the degree distribution is the probability distribution of these degrees. A power index γ can beused to describe the curve if the network’s degree distributionfollows a power-law distribution.For the network of RTK characters, the top ten charactersof the highest degree are Cao Cao, Liu Bei, Zhuge Liang, SunQuan, Zhao Yun, Guan Yu, Yuan Shao, Sima Yi, Lv Bu, andWei Yan. The average degree of the network is 10.31, and thedegree distribution can be illustrated in Figure 2. It emerges

Wireless Communications and Mobile Computing3P(k): fraction of nodesDegree �0.5005010020015025000.51–11.522.5y –1.2864x 2.5654k: degreeNode degree(a)(b)Figure 2: Degree distribution and power-law degree distribution on a log-log scale.Shortest path between two %0.98%0.14%670%123450.01% 0.0002%89Figure 3: Distribution of shortest-path length.to be a heavy-tailed distribution (see Figure 2(a)). As the datacan be approximated with a linear function y 1:2864x 2:5654 on a log-log scale in Figure 2(b), we conclude thatthe degree distribution follows a power-law distribution.4.2. Average Shortest-Path Length. The shortest path betweentwo nodes is a path where the number of links is minimized.Accordingly, the length of the shortest path is the number oflinks that the path contains. A sum of all shortest-pathlength divided by the number of links is the averageshortest-path length.The average shortest-path length of the RTK network is3.1743. Hence, one character can be connected to others inthree steps on average, which means any two characters are“three-degree separation.”The distance of the largest shortest path in the network iscalled diameter. In this paper, the RTK network’s diameter is9. One path of the diameter is from Liu Ai to Zhang Shang:Liu Ai, Wang Li, Dong Zhao, Cao Hong, Cao Cao, SimaYan, Yang Hu, Du Yu, Lu Jing, and Zhang Shang. The distribution of the shortest-path length between any two characters can be illuminated in Figure 3. According to the figure,47.63% of the shortest-path length in the RTK network is 3and about 92.15% is between length 2 and length 4.4.3. Clustering Coefficient. A clustering coefficient [10, 11]measures the extent to which a network’s nodes tend to clustertogether. The clustering coefficient of node x can be given byCx 2Ex:kx ðkx 1Þð1ÞEx is the existing links among neighbors of node x. As kxis a degree of node x, ð1/2Þkx ðkx 1Þ represents the numberof potential links for node x’s neighbors. Therefore, theaverage value for all C x is the clustering coefficient of thewhole network.C 1 C :N x xð2ÞA random network is produced by an Erdős-Rényi (ER)model utilizing the same number of nodes and links as theRTK network. The comparison between random network

4Wireless Communications and Mobile ComputingTable 1: Comparison between RTK and random network.RTK networkRandom networkNumber of nodesNumber of linksAverage degreeAverage shortest-path lengthClustering 33.27020.53060.0082Table 2: Comparison of three subnetworks and the whole network.ShuWuWeiThe whole networkDensityClustering coefficientAverage shortest-path 0.62170.53062.05632.30542.59533.17434569Table 3: Top 10 characters in rank with the highest centrality.Ranking12345678910Degree centralityBetweenness centralityCloseness centralityCao Cao (0.2094)Liu Bei (0.2085)Zhuge Liang (0.1714)Sun Quan (0.1060)Zhao Yun (0.0998)Guan Yu (0.0972)Yuan Shao (0.0813)Sima Yi (0.0742)Lv Bu (0.0716)Wei Yan (0.0707)Cao Cao (0.1751)Liu Bei (0.1304)Zhuge Liang (0.1093)Sun Quan (0.0695)Sima Yi (0.0430)Zhao Yun (0.0413)Liu Shan (0.0402)Guan Yu (0.0375)Yuan Shao (0.0369)Jiang Wei (0.0357)Cao Cao (0.4528)Liu Bei (0.4442)Zhuge Liang (0.4313)Sun Quan (0.4073)Guan Yu (0.3969)Zhao Yun (0.3963)Sima Yi (0.3924)Wei Yan (0.3856)Yuan Shao (0.3842)Cao Ren (0.3824)and RTK network is shown in Table 1. The RTK network isa small-world network because it has a larger clusteringcoefficient as well as a smaller average shortest-path lengthcompared with a random network.We choose the characters who clearly belong to the threegroups of Wei, Shu, and Wu and calculate the networkfeatures of the three kingdoms, respectively. The results aresummarized in Table 2.The character relationship networks within three groupshave high clustering coefficients and small average shortestpath lengths. Consequently, all of the three subnetworks are“small-world” networks. From the Shu to Wu and Wei, thedensity and clustering coefficient of the subnetworksdecrease sequentially except for the clustering coefficient ofWu. On the contrary, the average shortest-path length anddiameter increase successively. This reflects a decrease inthe closeness of the connections among the groups. In otherwords, the connections among characters in Wei are lessclosely than Wu and Shu.4.4. Density. The density of a network shows the ratio oflinks, which can be simply calculated by formula (3). N andE are the number of nodes and links. It describes the portionof all possible links in a network that are actual connections.The value is a fraction between 0 and 1. As the density of theRTK network is 0.0091, it is a sparse network.d 2E:N ðN 1Þð3Þ4.5. Centrality. The centrality measures the importance ofnodes, containing degree centrality, betweenness centrality,and closeness centrality.Degree centrality is a measure of centrality based ondegree. A high-degree node is a local center within thenetwork. Betweenness centrality expresses the extent thatthe node falls on the shortest path between other pairs ofnodes. A node with a high betweenness is capable of controlling the interactions between two nonadjacent nodes [5].Closeness centrality is a measure of the average shortestdistance from each node to each other node. It evaluatesthe closeness that a node is to all the other nodes [3].Three centralities of characters in the RTK network arecalculated, respectively. Table 3 gives the top ten charactersof the highest centrality. The value of centrality is listed inparentheses. From Table 3, we can find eight names listedin three centralities: Cao Cao, Liu Bei, Zhuge Liang, Sun

Wireless Communications and Mobile Computing5Table 4: Cooccurrence matrix of main characters.CooccurrenceLiu BeiCao CaoSun QuanZhuge LiangGuan YuZhang FeiLiu BeiCao CaoSun QuanZhuge LiangGuan YuZhang 28336432510658184327247751622547165Table 5: Ochiai similarity matrix of main characters.CooccurrenceLiu BeiCao CaoSun QuanZhuge LiangGuan YuZhang FeiLiu BeiCao CaoSun QuanZhuge LiangGuan YuZhang le 6: The clustering result of the RTK network (k is the finalnumber of hierarchical clusters).kPrecisionRecallF score 1112131415161718 43.83%47.08%71.10%71.10%87.66%87.66%87.66%87.66% 78.90%78.90%75.00%75.00%73.38%62.99%59.09%50.97% 56.35%58.97%73.00%73.00%79.89%73.30%70.60%64.46% Quan, Zhao Yun, Guan Yu, Yuan Shao, and Sima Yi. Theyare in a significant position in the character network.5. Cluster Analysis5.1. Cooccurrence and Similarity Matrix. The cooccurrencematrix measures the frequency that two characters appeartogether. A cooccurrence matrix of main characters in theRTK network is presented in Table 4. It is a symmetricmatrix, and data on the diagonal show the frequencies ofcharacters that appear in text.The cooccurrence of two characters cannot be used as thesimilarity because it is greatly affected by frequency. We normalize the cooccurrence matrix utilizing the Ochiai coefficient[12] and obtain the similarity matrix. Ochiai coefficient isdefined byn ð A BÞK ��ffiffiffiffiffiffiffi :nðAÞ nðBÞð4ÞFigure 4: A link with a high edge betweenness.Table 7: The clustering result of the RTK network (k is the finalnumber of hierarchical clusters).Number of removals051015202530354045505560 PrecisionRecallF 64%88.96%88.96%89.94%89.94%47.73% .70%73.38%73.05%72.08%90.58% .62%80.42%80.62%80.02%62.52% As A and B are sets, nðAÞ is the number of elements inA and nðA BÞ is the number of cooccurrence. Thesimilarity matrix calculated by the Ochiai coefficient isdescribed in Table 5.

6Wireless Communications and Mobile ComputingF 6111162126313641465156Number of removalsFigure 5: The change of F score according to the number of removals.5.2. Hierarchical Clustering5.2.1. Clustering Algorithm. An agglomerative hierarchicalclustering algorithm utilizing the Ochiai similarity matrix isimplemented to complete the task of cluster analysis. It is abottom-up approach. Initially, each node is treated as a singlecluster. Two clusters with the largest Ochiai similarity arecombined into a new bigger cluster. The clustering algorithmstops when it achieves a setting threshold or there is only onecluster left. The similarity between two clusters is defined asthe average similarity between each of their nodes.5.2.2. Evaluation. The P-IP scores [13] are adopted tomeasure the clustering result. There are m characternames and n clusters. Suppose C ij is the number of character names marked with label j for character name i,where j arg max fC ik g. The precision and recall ofk k 1,2, ,ncharacter name i can be given byPi Cijm l 1 Clj,C ij:Ri n k 1 C ikð5ÞThus, the F score is calculated byFi 2Pi Ri:P i Rið6ÞThe overall precision, recall, and F score are the averagesof corresponding values. Moreover, the gold standard is builtby marking the character name with a specific kingdom tag.For example, Cao Cao is tagged with “Wei” and Liu Bei istagged with “Shu.” Finally, 308 character names with definitekingdom tags are secured for cluster analysis.5.2.3. Clustering Result. The result of hierarchical clustering isillustrated in Table 6. The F score achieves the best value of79.89% when the number of clusters k is 15.5.3. Improved Clustering Algorithm. In the RTK network,some characters play a vital role in interconnections of different kingdoms, like “Lu Su” between Wu and Shu, “HuangGai” between Wu and Wei. These characters have a highbetweenness according to the definition of betweenness (seeSection 4.5). Further, the node betweenness can be extendedto “edge betweenness” [14]. The link with a high edgebetweenness is often a bridge between different clusters (seered link in Figure 4). Therefore, removing these highbetweenness links by setting a similarity of 0 will reduce theintercluster similarity and improve the clustering result eventually. The removal operation can be introduced as preprocessing before conducting the cluster analysis.The improved clustering algorithm using edge betweenness is executed, and the result is displayed in Table 7. Whenthe number of removals is zero, it is the baseline of theoriginal algorithm. With an adequate removing operation,the F score reaches a peak of 80.87%. Nevertheless, removingtoo many links will destroy the whole network and make theF score decline dramatically (see Figure 5).5.4. Analysis. Data visualization is also given to display thecharacteristics of historical figures in the RTK network. Ashierarchical clustering can be depicted as a tree-based visualdendrogram, we visualize the character relationship in theRTK novel from Chapter 43 to 50, which is a period describing “the battle of Red Cliffs” (see Figure 6).As can be seen from Figure 6, six parts can be dividedmanually. H1 and H3 are groups containing characters from“Wu,” like Sun Quan and Sun Ce. H2 encompasses maincharacters from “Shu” and “Wu” in the battle of Red Cliffs:Liu Bei, Guan Yu, Zhuge Liang, Zhou Yu, Lu Su, etc. However, there are two exceptions: Cao Cao and Cheng Yu,because they are highly connected with other main charactersin the battle of Red Cliffs. Further, H1, H3, and H2 mergeinto a bigger cluster in the hierarchical clustering becausethese characters are from the alliance of “Wu” and “Shu”against Cao’s army.On the other hand, H5 is composed of characters from alarge group “Wei,” including Xiahou Dun, Xiahou Yuan, CaoRen, and Cao Hong. H6 includes few characters from “Shu”

Wireless Communications and Mobile Computing7Zhang HongBu ZhiZhang zhaoGu YongXiaoqiaoDaqiaoSun CeLiu BeiGuan YuZhuge LiangLu SuZhou YuHuang GaiCao CaoKan ZeGan NingCheng YuGuan PingJian YongPan ZhangDong XiZhu ZhiLv MengLing TongTaishi CiLu XunSun QuanZhuge JinLv FanXu ShengDing Feng2Cheng puZhou Tai2Han DangJiang QinChen WuCao ZhiXun YouXu ShuPang TongZang BaLiu FuLu JiXue ZongYu FanZhang feiXu ChuZhao YunZhang LiaoXu HuangCao RenCao HongXiahou DunXiahou YuanLv QianZhang HeLi DianLe JinYu JinMao JieWen PinMi FangLiu FengMi ZhuZhang WenLuo TongH1H2H3H4H5H60.00.20.40.60.81.0Figure 6: Dendrogram of clustering result for the period of “the battle of Red Cliffs.”or “Wu.” H4 is not a cluster, and it contains a number ofcharacters from different kingdoms.6. ConclusionsThis paper developed a general framework for analyzing thecharacter relationshi

The Romance of the Three Kingdoms (RTK) is a classical Chinese historical novel by Luo Guanzhong. This paper establishes a research framework of analyzing the novel by utilizing coword and cluster analysis technology. At the beginning, we segment the full text of the novel, ex

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

On HP-UX 11i v2 and HP-UX 11i v3 through a cluster lock disk which must be accessed during the arbitration process. The cluster lock disk is a disk area located in a volume group that is shared by all nodes in the cluster. Each sub-cluster attempts to acquire the cluster lock. The sub-cluster that gets

Cluster Analysis depends on, among other things, the size of the data file. Methods commonly used for small data sets are impractical for data files with thousands of cases. SPSS has three different procedures that can be used to cluster data: hierarchical cluster analysis, k-means cluster, and two-step cluster. They are all described in this

APPLIED ENGLISH GRAMMAR AND COMPOSITION [For Classes IX & X] English (Communicative) & English (Language and Literature) By Dr Madan Mohan Sharma M.A., Ph.D. Former Head, Department of English University College, Rohtak New Saraswati House (India) Pvt. Ltd. Second Floor, MGM Tower, 19 Ansari Road, Daryaganj, New Delhi-110002 (India) Ph: 91-11-43556600 Fax: 91-11-43556688 E-mail: delhi .