Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons

Unsupervised concept identification through clustering, i.e., identification of semantically related words and phrases, is a common approach to identify contextual primitives employed in various use cases, e.g., text dimension reduction, i.e., replace words with the concepts to reduce the vocabulary size, summarization, and named entity resolution. We demonstrate the first results of an unsupervised approach for the identification of groups of persons as actors extracted from a set of related articles. Specifically, the approach clusters mentions of groups of persons that act as non-named entity actors in the texts, e.g.,"migrant families"="asylum-seekers."Compared to our baseline, the approach keeps the mentions of the geopolitical entities separated, e.g.,"Iran leaders"!="European leaders,"and clusters (in)directly related mentions with diverse wording, e.g.,"American officials"="Trump Administration."


Introduction
Methods for concept identification seek to identify words and phrases that refer to the same semantic concept. As such, concept identification is a crucial task employed in various use cases, such as information summarization, information extraction, named entity resolution, and coreference resolution. While in some domains, e.g., medicine, semantic (dis)similarities are clearly distinct, in others, e.g., the news domain, phrases referring to groups of persons are often semantically highly related yet conceptually different, e.g., "American officials" and "Israeli officials" have similar roles but act as different actors. Identification of conceptually fine-grained groups of persons is a challenging task due to two key issues: first, high semantic relatedness of mentions that yet perform conceptually different roles, e.g., "immigration lawyers" and "undocumented 5 The final authenticated version is available online at https://doi.org/10.1007/

978-3-030-71292-1_40
arXiv:2107.00955v1 [cs.CL] 2 Jul 2021 immigrants." Second, event-specific coreferential relations are often prone to high lexical diversity due to the word choice and labeling [7], e.g., "Dreamers" and "DACA recipients." In this work, we propose an unsupervised concept identification approach that automatically extracts conceptually fine-grained clusters of related mentions referring to groups of people from a set of text documents. We narrow down our problem statement to news articles since word choice is especially subtle and rich in the news domain. The goal of our approach is to extract from news stories those group-actors that are the main content elements and yet missed by current coreference resolution and named entity recognition.
Scholars have proposed supervised tasks where a model is trained to identify domainspecific concepts, e.g., reactions to drugs [15,17], by automatically labeling phrases with their respective concepts, e.g., persons or other named entities. Most frequently, concept identification is an unsupervised task to explore the relations between the words or phrases contained in a text [8][9][10]15]. Unsupervised methods use clustering, e.g., Kmeans [9], which find patterns between the elements without prior knowledge. Such methods are typically integrated as preprocessing or intermediate steps so that their results can be used in downstream analysis steps. While less bound to the content of text datasets, clustering-based methods are more difficult to use because one has to find a clustering parameter configuration to yield suitable results for the dataset at hand.

Methodology
We propose an unsupervised clustering approach that identifies mentions directly referring to the same group of individuals in a given context, e.g., "asylum-seekers" and "Central American immigrants," and groups of individuals semantically related to countries or organizations as the representatives of both, i.e., indirectly coreferential, e.g., "White House officials" -"Trump administration." For the clustering itself, we employ the core principle of two clustering algorithms: 1) OPTICS clustering algorithm [1], i.e., we form clusters by decreasing cluster density; 2) hierarchical clustering (HC) [14], i.e., we use the weighted average linkage criterion to merge clusters.

Mention extraction
A mention is a noun phrase (NP) automatically extracted from a parsed text, e.g., by CoreNLP [11]. We extract NPs not larger than 20 words. For each mention we assign a representative phrase (RP), i.e., a shortened version of the phrase that includes only the most frequent dependency parsing components of a NP: heads of NPs, compounds, and adjectival and noun modifiers. We use unique RPs as clustering units, i.e., we assume that within a narrow article-based context identical RPs of different mentions m i share same meaning rp l = rp(m i ). To select mentions referring to groups of persons, we apply the entity type identification methodology proposed by Hamborg et al. [6] and keep all mentions of four entity types: (1) multiple persons NE ("person-nes"), e.g., "Republicans," (2) multiple persons non-NE ("person-nns"), e.g., "GOP leaders," (3) single person non-NE ("person-nn"), e.g., "a Republican attorney," and (4) group of people ("group"), e.g., "Republican establishment." Fig. 1 depicts how these types form hypernym-hyponym relations. While "group" is the most general and aggregated type, "person-nn" is the type that has the largest level of details, i.e., the single instances of the groups. Due to the comparably balanced level of detail inherent to concepts of the types "person-nes" and "person-nns," we coin their mentions core mentions.

Pipeline
Our approach consists of six stages where the first identifies cluster cores and subsequent stages expand the clusters: (1) preprocessing, (2) identify cluster cores, (3) form cluster bodies, (4) add border mentions, (5) form non-core clusters, and (6) merge final clusters. Fig. 2 depicts the principle of the approach.

Preprocessing
In early experiments, we observed that clustering the unweighted mean word vector representation of RPs, i.e., a mean vector of the vectorized phrases' words, yielded inefficient concept separation, e.g., phrases "American people" and "Mexican people" were clustered into one concept although they refer to different nations. On the contrary, two phrases could be coreferential but only in the narrow event-determined context, e.g., "young illegals" -"DACA recipients." To improve the effectiveness of clustering, we apply modifications to the vector representation, i.e., (1) employ a weighting scheme of the named entity (NE) components of the RPs and (2) calculate more than one similarity matrix to introduce more than one level of similarity between RPs.
Word vector weighting In the narrow article-specific context, word vector weighting [21] increases the semantic proximity in the vector space and facilitates the identification of the semantic relatedness and coreferential relations (cf. Fig. 3). We represent phrases as the mean of their weighted words' embedding, i.e., where v(i) is a vector representation of the i-th word and w i is a weight assigned to this word. We use word2vec [13] as a word embedding model due to its ability to represent both single words and multi-word phrases, resulting in more precisely defined positions of phrases in the vector space. A vector representation V (rp k ) depends on its relations to rp l to which a similarity value is calculated. A weight w i for a word v i in (1) is selected as following: where ne(rp i ) is an extracted NE from rp i , e.g., ne("Congress members") = "Congress" , a controlling matrix that allows or restricts similarity calculations between phrases that contain NEs, and wt = 1.7.
An NE-grid N G determines which types of mentions can be merged. For example, if N G ne(rp k ),ne(rp k ) = 0, then the mentions of one geo-political entity (GPEs) are not compared to mentions of another GPEs, e.g., "French" = "North Korea." If a value of a NG's cell N G ne(rp k ),ne(rp l ) > 0 then N G favors to merge the corresponding RPs, e.g., "U.S." = "Americans." The NE-grid is spanned across combined NE chains Ch of two types: country + nationality (Ch cn ) and organization + persons (Ch op ). To construct NE-chains, we use the relations between the terms in the semantic network ConceptNet [18]. We iterated over the extracted NEs and interlinked them if their corresponding ConceptNet terms have a "SimilarTo" relation. Afterward, we restore full connectivity between the sub-chains, i.e., the restored connectivity of the extracted "United States"-"U.S." and "U.S."-"American" chains yields a chain ch a "United States"-"U.S."-"American." Based on the NE-chains, we constructed the NE-grid N G: where m = cn ∨ op.
Multiple similarity levels To create additional levels of similarity, we calculate three similarity matrices: 1) a head-similarity matrix SH, 2) a phrase-similarity matrix SP , and 3) a core-phrase similarity matrix SP C: where h k = h(rp i ) is the head of a phrase, e.g., h("Congress members") = "members," cossim is cosine similarity, v(·)/V (·) is a vector representation of words or phrases, thr simrp = 0.4 is a threshold for the minimum RP similarity, and SP C is a subset matrix of the SP with the RPs that are core-mentions. The output of the preprocessing step consists of three similarity matrices (SH, SP , SP C) that represent similarity of RPs as to three levels and an NE-grid N G that determines restriction rules for operations between mentions.

Identification of the cluster cores
We start clustering with identification of the cluster cores (CC), i.e., cluster the core mentions' RPs (CRP) as the most distinctive among all RPs (see Sec. 3.1). Two core RPs crp i and crp j form a CC if they meet two requirements: (1) SP C crpi,crpj > 0 and SH crpi,crpj > 0, (2) crp i and crp j were similar to a sufficient number of other core RPs according to the ratio matrix RM . Following OPTIC's principle of creating more similarity levels compared to one similarity metric, we form a ratio matrix RM for the core RPs. Each element in RM shows a normalized count of the core RPs to which two RPs at a hand are similar to: and b(·) is a binary representation of values in a vector (1 if a cell value is larger than 0, else 0); OR thr = 0.5 ≤ log 5000 |RP | ≤ 0.7, i.e., the threshold is balanced based on the size of unique RPs: a larger number of RPs imposes more strict similarity requirements for the cluster cores. Fig. 4: Identification of chains of related core representatives: this example yields two core clusters.
Finally, we iterate over the elements of RM and recursively collect chains of the interlinked CRPs, as shown in Fig. 4. A chain is considered complete once no other core RPs can be added to it.

Forming of cluster bodies
To further extend the clusters, we form cluster bodies CB by expanding the identified cored with the unclustered RPs (Fig. 5). First, we assign RPs to the cluster cores if a RP was similar to at least one of the core RPs and the merge is allowed by N G: Second, we intersect cluster bodies (CB) with each other to check if there were noncore RPs that belonged to both CBs. If so, we resolve the conflicting RPs by calculating a normalized similarity score between an rp conf ∈ CB i ∩ CB j and non-conflicting RPs of each CB, and choosing a CB with the largest similarity score: CB best = arg max i∈|CB| sim rp conf ,CBi i.e., similarity consists of the number of overlapping words between an RP and clustered RPs and the sum of their cross-similarity values.

Adding border mentions
We define border mentions as the remaining RPs that are similar at least to two body RPs (Fig. 6). We add a border RP rp to a cluster body CB i and formed a cluster C i if rp is similar to at least two RPs in CB i and has the largest normalized similarity score to CB i :

Form non-core clusters
Some unmerged RPs can form non-core clusters, i.e., they are similar to other RPs but do not meet requirements to become core points (see Fig.2). We form a non-core cluster around a rp as:

Merging final clusters
When all clusters are formed, the final step of the pipeline is to check if clusters can be further merged based on combined features of word count and word embeddings. We create an extended list of modifiers, i.e., all the previous (see Sec.3.3) and also number and apposition modifiers. We compare the identified clusters according to a cosine similarity of the weighted vector representation using this extended list. Each cluster C i is, first, represented with the counted RPs' lowercased lemmas L i . We treat clusters as documents and transformed the clusters into the TF-IDF representation [21]. Each cluster C i is represented as a TF-IDF-weighted average word embedding representation of its lemmas: where t(l) if a TF-IDF coefficient of a lemma l in a cluster C i . We construct a cluster cross-similarity matrix SC, where each element is: where sim = cossim(V C(C i ), V C(C j )). Following the principle from Fig. 4, we identify chains of clusters, i.e., the final clusters that contain related mentions.

Preliminary evaluation and Discussion
As a preliminary evaluation, we extracted concepts of (in)directly related mentions from five sets of event-related news articles with the identical parameters and we qualitatively analyzed the results. We used NewsWCL50 (N) [6] and ECB+ (e) [5] as datasets that fulfill such criterion for the text collection. Table 1 depicts examples of the identified concepts, i.e., clusters of the related mentions, from a subset of the events of each dataset. The column with concept names contains manually created labels that summarized automatically identified clusters of the related mentions. The column "Mentions" contains unique mentions of an identified clusters. Mentions are separated with the keywords that indicate the stages at which the mentions were clustered.
The analysis of the indirectly referring mentions to groups of people shows that the proposed clustering approach successfully separated mentions related to GPEs such as "Israeli officials" and "American officials." These mentions refer to different concepts but are quite similar due to the shared word "officials." The identified concepts from the event N9 ("American officials," "Iranian regime," "Israeli officials,", and "European leaders") show that the approach effectively separated mentions of multiple GPEs from the same text.
Clustering of directly referring mentions, e.g., from the "Central American migrants" concept from event N6, resolves mentions such as "Central American transgender women," "asylum-seekers," "caravan," and "undocumented immigrants." This demonstrates that the proposed approach successfully clustered mentions that are exposed to context-specific coreference relations, i.e., none of these mentions are commonknown synonyms to each other. Moreover, the approach successfully separated the "Immigration lawyers" concept from the "Migrants" concept although the noun "immigration" is shared among the two, which makes these mentions semantically similar. On the contrary, the "Migrants" concept contains falsely clustered mentions that refer to the various supporters of the immigrant caravan. Separation of such mentions with semantically close yet conceptually different meanings remains the biggest challenge for the algorithm and requires improvements to the clustering approach.
To test, if a state-of-the art clustering algorithm achieved similar concepts, we reclustered the mentions from two exemplary chosen documents, N6 and N9 in Table 1, with hierarchical clustering (HC). Table 2 shows the results of HC with average linkage criterion, cosine distance (using a threshold 0.7) for both datasets. 6 Likewise in Table 1, we manually named the concepts which contained conceptually related mentions. While some of the mentions formed more narrowly and fine-grained defined concepts, HC also clustered conceptually different mentions and left approximately 25% of the input mentions unclustered ("NOT" clusters in Table 2).
The proposed clustering approach might be beneficial to cross-document coreference resolution (CDCR), i.e., resolution of the coreferential mentions of various entities across sets of related text documents. Such entity types as groups of people and mentions of the GPEs are some of the targets for CDCR. When implemented as a part  Table 1: Results produced with the proposed concept identification approach. "N"/"e"+ID indicates a dataset and the internal ID of the events of each dataset.  of a CDCR model, our concept identification approach can have strong positive impact to the overall performance due the resolution of coreferential mentions of high lexical diversity. Such mentions are typically a subject of bias of word choice and labeling, i.e., contain biased wording that contains polarized connotation and typically is coreferential only in a narrow context of a reported event.

Conclusion and Future work
We proposed a clustering approach to identify both direct mentions referring to groups of individuals and indirect person mentions related to the geo-political entity (GPEs) or organizations, i.e., job titles that represent these entities. In our evaluation, we found that terms such as "American officials" were resolved reliably as mentions related to GPEs or organizations. Moreover, the approach capably clustered mentions that lack NE-components while maintaining a fine-grained level of conceptualization among the clusters of these mentions. Further, the approach resolved mentions referring to groups of individuals that have highly-context dependent synonymous or coreferential relations, as apposed to universal synonyms. Thus, we think the approach is a robust solution to cross-document coreference resolution (CDCR), especially when employed in texts containing coreferential mentions with high lexical diversity. As future work directions, we seek to test the proposed approach with other word vector models, e.g., fastText [12] and ELMo [16], or phrase vector models [20], pretrained and fine-tuned on event-related news articles. We also seek to address current shortcomings, e.g., to resolve one-word mentions without modifiers, e.g., "officials," we plan to devise an additional word sense disambiguation step. Each particular occurrence of a one-word mention will be resolved based on the mention's the context. Lastly, we will perform a quantitative analysis of the approach applied to CDCR, i.e., tested on the state-of-the art manually annotated CDCR datasets.