DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules

Dialect Adaptation from a Fine-Grained Level

1Harvard University, 2Georgia Institute of Technology, 3Stanford University
DADA
DADA dynamically composes adapters that handle specific features of dialectal variation to adapt an SAE model to various dialects by leveraging their commonality. We train nearly 200 feature adapters to capture the linguistic differences between SAE and its dialect variants. These feature adapters can be composed flexibly and arbitrarily to target different dialects.

Abstract

Existing large language models (LLMs) that mainly focus on Standard American English (SAE) often lead to significantly worse performance when being applied to other English dialects. While existing mitigations tackle discrepancies for individual target dialects, they assume access to high-accuracy dialect identification systems. The boundaries between dialects are inherently flexible, making it difficult to categorize language into discrete predefined categories. In this paper, we propose DADA (Dialect Adaptation via Dynamic Aggregation), a modular approach to imbue SAE-trained models with multi-dialectal robustness by composing adapters which handle specific linguistic features. The compositional architecture of DADA allows for both targeted adaptation to specific dialect variants and simultaneous adaptation to various dialects. We show that DADA is effective for both single task and instruction finetuned language models, offering an extensible and interpretable framework for adapting existing LLMs to different English dialects.

Background

Current NLP tooling is often trained and evaluated on dominant language variants, such as Standard American English (SAE). This results in a significant decline in performance when these tools are applied to non-SAE dialects (even LLMs are not exempt). Such performance disparities raise ethical and moral concerns regarding the potential for racial disparities in the seemingly expeditious development of language technologies.

Existing research to mitigate this disparity has mainly focused on dialectal adaptation targeting individual dialects of interest. These methods require the development of separate systems for different dialects and highly accurate dialect identification systems for real-world uses. However, such separate systems are not yet available for many dialects and related languages.

Dialect and Non-Standard Linguistic Feature

Dialects can be described by a common set of linguistic rules (non-standard linguistic features), that describe the differences between SAE and various other English dialects. Each dialect can express a subset of this feature space. In addition, dialects are not deterministic patterns but rather ranges of acceptable use of these features that speakers adjust based on social contexts. Therefore, dialects do not neatly fit into predefined categories. Here is an example: Negative Concord, a linguistic feature common in many English dialects, involves using multiple negative words or phrases to reinforce a negative meaning.

feature_example

To this end, our work proposes a method that can accommodate the diversity of dialect variants at a fine-grained level (non-standard linguistic features or linguistic rules).

Dialect Adaptation via Dynamic Aggregation

We introduce Dialect Adaptation via Dynamic Aggregation (DADA), a modular method for adapting an existing model trained on the SAE to accommodate dialect variants at a finer-grained level, in 3 steps.

process

Step 1: Synthetic Datasets Construction

Previous works have discerned a series of linguistic divergences and devised Multi-Value, a collection of lexical and morphosyntactic transformation rules between SAE and its 50 dialect variants, including Appalachian English (AppE), Chicano English (ChcE), Colloquial Singapore English (CollSgE), Indian English (IndE), and African American Vernacular English (AAVE), among others. For each transformation rule, we first generate a corresponding synthetic dataset by applying the respective rule to each individual training example within the original training dataset.

Step 2: Feature Adapter Training

Secondly, we develop a feature adapter for each linguistic transformation rule by training it on the corresponding synthetic dataset. Each trained feature adapter can capture a specific type of linguistic difference between SAE and its dialect variants.

Step 3: Dynamic Aggregation

However, it is common for multiple linguistic differences to co-occur within a single sentence in real-world scenarios, thereby necessitating the model to consider these distinct linguistic features to varying degrees simultaneously.

Therefore, in the third step, we propose to dynamically aggregate the trained feature adapters, into the SAE-trained backbone model via an additional fusion layer. Through training on the super-synthetic dataset, a parameterized compositional mixture of feature adapters can be learned to identify the applied linguistic features for a given input and activate the corresponding feature adapters, thereby facilitating the effective addressing of linguistic discrepancies between SAE and its dialect variants.

DADA Can Improve Multi-Dialectal Robustness!

We demonstrate here how DADA can enable the adaptation of an existing SAE model to multiple dialect variants, taking the RoBERTa Base model and MNLI task as an example.

process

Compared to the standard SAE model trained on the original MNLI dataset (Baseline), DADA demonstrates significant performance improvements across all evaluated dialects and even on SAE. Moreover, DADA delivers comparable performance to the strong baseline provided by individual further adapter tuning on the SAE-trained model with dialect-specific training data (Individual). However, while this approach requires a perfect dialect identification system and multiple models, our approach uses a single model and therefore does not rely on dialect identification. This makes DADA a simpler and more realistic method for use when the target dialect distribution is unknown.

DADA Can Be Task-Agnostic!

Recent LLMs are instruction-tuned for various tasks, which is orthogonal to our method, making it possible to combine the two approaches easily. Here, we demonstrate that DADA can be employed in instruction-tuned LLMs to improve their task-agnostic performance on dialects!

process

It is surprising to note that although individual adapter tuning (Individual) demonstrates improvements in 4 out of 7 tasks, the overall average performance is even inferior to that of the SAE baseline. In contrast, DADA consistently outperforms both the SAE Baseline and Individual across all evaluated tasks.

DADA Has Great Interpretability!

We perform a correlation analysis of the 10 feature adapters of AAVE for the linguistic features applied to the input data. We plot the results for layers 1, 3, 7, 11 here.

process
Correlation Coefficients for AAVE adaptation between the feature adapters (column) and the inputs to which specific linguistic features (row) apply in layers 1, 3, 7, 11. We use abbreviations for certain terms, such as "nc" for "negative_concord."

We find that significant correlations in utilization on the lower layers (0-3) can be observed, while those on the middle and higher layers are found to be negligible. This is consistent with intuition, as the primary distinction between SAE and its dialect variants lies in their linguistic features (lexical and morphosyntactic), which are mainly captured by the lower layers of the model. This analysis demonstrates that DADA can detect which linguistic features are relevant to the given input, and subsequently trigger the corresponding feature adapters. This highlights the interpretability of DADA!

Ethics Statement

Previous linguistic works on dialectal features may not fully or accurately document the natural usage patterns of all existing dialects in terms of their linguistic rules. As a result, we acknowledge that our proposed method, which relies on these dialectal features from prior literature, may not take some undocumented features associated with dialects into account. However, by curating more dialectal features, our method can be easily extended to a broader range of dialects. Additionally, as DADA is task-agnostic when applied to instruction-tuned models, malicious individuals might misuse it. To address this concern, we release DADA with a license that explicitly prohibits its usage for purposes of deception, impersonation, mockery, discrimination, hate speech, targeted harassment, and cultural appropriation targeting dialect-speaking communities.

Acknowledgement

We would like to thank the anonymous reviewers and SALT lab members for their valuable feedback! This work was partially sponsored by the Defense Advanced Research Project Agency (DARPA) and NSF grant.

BibTeX


      @inproceedings{liu2023dada,
         title={DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules},
         author={Yanchen Liu and William Held and Diyi Yang},
         booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
         year={2023}
      }