発表文献
2024
- SLTRecursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker RecordingsShota Horiguchi, Atsushi Ando, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, and Marc DelcroixIn IEEE Spoken Language Technology Workshop (SLT), Dec 2024
This paper proposes a method for extracting speaker embedding for each speaker from a variable-length recording containing multiple speakers. Speaker embeddings are crucial not only for speaker recognition but also for various multi-speaker speech applications such as speaker diarization and target-speaker speech processing. Despite the challenges of obtaining a single speaker’s speech without pre-registration in multi-speaker scenarios, most studies on speaker embedding extraction focus on extracting embeddings only from single-speaker recordings. Some methods have been proposed for extracting speaker embeddings directly from multi-speaker recordings, but they typically require preparing a model for each possible number of speakers or involve complicated training procedures. The proposed method computes the embeddings of multiple speakers by focusing on different parts of the frame-wise embeddings extracted from the input multi-speaker audio. This is achieved by recursively computing attention weights for pooling the frame-wise embeddings. Additionally, we propose using the calculated attention weights to estimate the number of speakers in the recording, which allows the same model to be applied to various numbers of speakers. Experimental evaluations demonstrate the effectiveness of the proposed method in speaker verification and diarization tasks.
- SLTInvestigation of Speaker Representation for Target-Speaker Speech ProcessingTakanori Ashihara, Takafumi Moriya, Shota Horiguchi, Junyi Peng, Tsubasa Ochiai, Marc Delcroix, Kohei Matsuura, and Hiroshi SatoIn IEEE Spoken Language Technology Workshop (SLT), Dec 2024
Target-speaker speech processing (TS) tasks, such as target-speaker automatic speech recognition (TS-ASR), target speech extraction (TSE), and personal voice activity detection (p-VAD), are important for extracting information about a desired speaker’s speech even when it is corrupted by interference speakers. While most studies have focused on the training schemes or system architectures for each specific task, the auxiliary network for embedding target speaker cues has not been investigated comprehensively in a unified cross-task evaluation. Therefore, this paper attempts to address a fundamental question: what is the preferred speaker embedding for TS tasks? To this end, for the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders (i.e., self-supervised or speaker recognition models) that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from target speaker identity in the form of a one-hot vector. To further understand the property of ideal speaker embedding, we optimize it using a gradient-based approach to improve performance on the TS task. Our analysis unveils that 1) speaker verification performance is somewhat unrelated to TS task performances, 2) the one-hot vector outperforms enrollment-based ones, and 3) the optimal embedding depends on the input mixture.
- INTERSPEECHFactor-Conditioned Speaking-Style CaptioningAtsushi Ando, Takafumi Moriya, Shota Horiguchi, and Ryo MasumuraIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024
This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while keeping style prediction performance.
- INTERSPEECHSpeakerBeam-SS: Real-Time Target Speaker Extraction with Lightweight Conv-TasNet and State Space ModelingHiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, and Marc DelcroixIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2024
Real-time target speaker extraction (TSE) is intended to extract the desired speaker’s voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been shown to model long-term dependency effectively. Owing to SSM, fewer dilated convolutional layers are required to capture temporal dependency in Conv-TasNet, resulting in the reduction of model complexity. We also enlarge the window length and shift of the convolutional (TasNet) frontend encoder to reduce the computational cost further; the performance decline is compensated by over-parameterization of the frontend encoder. The proposed method reduces the real-time factor by 78% from the conventional causal Conv-TasNet-based TSE while matching its performance.
- ICASSPStreaming Active Learning for Regression Problems Using Regression via ClassificationShota Horiguchi, Kota Dohi, and Yohei KawaguchiIn IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2024
One of the challenges in deploying a machine learning model is that the model’s performance degrades as the operating environment changes. To maintain the performance, streaming active learning is used, in which the model is retrained by adding a newly annotated sample to the training dataset if the prediction of the sample is not certain enough. Although many streaming active learning methods have been proposed for classification, few efforts have been made for regression problems, which are often handled in the industrial field. In this paper, we propose to use the regression-via-classification framework for streaming active learning for regression. Regression-via-classification transforms regression problems into classification problems so that streaming active learning methods proposed for classification problems can be applied directly to regression problems. Experimental validation on four real data sets shows that the proposed method can perform regression with higher accuracy at the same annotation cost.
- CHiMENTT Multi-Speaker ASR System for the DASR Task of CHiME-8 ChallengeNaoyuki Kamo*, Naohiro Tawara*, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, and Shoko ArakiIn The 8th International Workshop on Speech Processing in Everyday Environments (CHiME-2024), Sep 2024(*) Equal contribution
We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach. We then apply guided source separation (GSS) with several improvements to the baseline system. Finally, we perform ASR using a combination of systems built from strong pre-trained models. Our proposed system achieves a macro tcpWER of 21.4 % on the dev set, which is a 57 % relative improvement over the baseline.
- PreprintMamba-based Segmentation Model for Speaker DiarizationAlexis Plaquet, Naohiro Tawara, Marc Delcroix, Shota Horiguchi, Atsushi Ando, and Shoko Arakiarxiv:2410.06459, Oct 2024
Mamba is a newly proposed architecture which behaves like a recurrent neural network (RNN) with attention-like capabilities. These properties are promising for speaker diarization, as attention-based models have unsuitable memory requirements for long-form audio, and traditional RNN capabilities are too limited. In this paper, we propose to assess the potential of Mamba for diarization by comparing the state-of-the-art neural segmentation of the pyannote pipeline with our proposed Mamba-based variant. Mamba’s stronger processing capabilities allow usage of longer local windows, which significantly improve diarization quality by making the speaker embedding extraction more reliable. We find Mamba to be a superior alternative to both traditional RNN and the tested attention-based model. Our proposed Mamba-based system achieves state-of-the-art performance on three widely used diarization datasets.
- PreprintGuided Speaker EmbeddingShota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, and Marc Delcroixarxiv:2410.12182, Oct 2024
This paper proposes a guided speaker embedding extraction system, which extracts speaker embeddings of the target speaker using speech activities of target and interference speakers as clues. Several methods for long-form overlapped multi-speaker audio processing are typically two-staged: i) segment-level processing and ii) inter-segment speaker matching. Speaker embeddings are often used for the latter purpose. Typical speaker embedding extraction approaches only use single-speaker intervals to avoid corrupting the embeddings with speech from interference speakers. However, this often makes speaker embeddings impossible to extract because sufficiently long non-overlapping intervals are not always available. In this paper, we propose using speaker activities as clues to extract the embedding of the speaker-of-interest directly from overlapping speech. Specifically, we concatenate the activity of target and non-target speakers to acoustic features before being fed to the model. We also condition the attention weights used for pooling so that the attention weights of the intervals in which the target speaker is inactive are zero. The effectiveness of the proposed method is demonstrated in speaker verification and speaker diarization.
- PreprintAlignment-Free Training for Transducer-Based Multi-Talker ASRTakafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, and Mimura Masatoarxiv:2409.20301, Sep 2024
Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers’ transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created by appending a prompt token corresponding to each speaker at the beginning of the transcription, reflecting the order of each speaker’s appearance in the mixtures. Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers’ speech with just one round of encoder processing. Experiments show that MT-RNNT-AFT achieves performance comparable to that of the state-of-the-art alternatives, while greatly simplifying the training process.
- PreprintThresholding Data Shapley for Data Cleansing Using Multi-Armed BanditsHiroyuki Namba, Shota Horiguchi, Masaki Hamamoto, and Masashi EgiarXiv:2402.08209, Feb 2024
Data cleansing aims to improve model performance by removing a set of harmful instances from the training dataset. Data Shapley is a common theoretically guaranteed method to evaluate the contribution of each instance to model performance; however, it requires training on all subsets of the training data, which is computationally expensive. In this paper, we propose an iterative method to fast identify a subset of instances with low data Shapley values by using the thresholding bandit algorithm. We provide a theoretical guarantee that the proposed method can accurately select harmful instances if a sufficiently large number of iterations is conducted. Empirical evaluation using various models and datasets demonstrated that the proposed method efficiently improved the computational speed while maintaining the model performance.
2023
- TASLPOnline Neural Diarization of Unlimited Numbers of Speakers Using Global and Local AttractorsShota Horiguchi, Shinji Watanabe, Paola Garcia, Yuki Takashima, and Yohei KawaguchiIEEE/ACM Transactions on Audio, Speech, and Language Processing, Jan 2023
A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multi-label classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of speakers of attractor-based EEND is empirically capped; it cannot deal with cases where the number of speakers appearing during inference is higher than that during training because its speaker counting is trained in a fully supervised manner. Our method, EEND-GLA, solves this problem by introducing unsupervised clustering into attractor-based EEND. In the method, the input audio is first divided into short blocks, then attractor-based diarization is performed for each block, and finally the results of each blocks are clustered on the basis of the similarity between locally-calculated attractors. While the number of output speakers is limited within each block, the total number of speakers estimated for the entire input can be higher than the limitation. To use EEND-GLA in an online manner, our method also extends the speaker-tracing buffer, which was originally proposed to enable online inference of conventional EEND. We introduce a block-wise buffer update to make the speaker-tracing buffer compatible with EEND-GLA. Finally, to improve online diarization, our method improves the buffer update method and revisits the variable chunk-size training of EEND. The experimental results demonstrate that EEND-GLA can perform speaker diarization of an unseen number of speakers in both offline and online inferences.
- APSIPA ASCSynthetic Data Augmentation for ASR with Domain FilteringTuan Vu Ho, Shota Horiguchi, Shinji Watanabe, Paola Garcia, and Takashi SumiyoshiIn Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Nov 2023
Recent studies have shown that synthetic speech can effectively serve as training data for automatic speech recognition models. Text data for synthetic speech is mostly obtained from in-domain text or generated text using augmentation. However, obtaining large amounts of in-domain text data with diverse lexical contexts is difficult, especially in low-resource scenarios. This paper proposes using text from a large generic-domain source and applying a domain filtering method to choose the relevant text data. This method involves two filtering steps: 1) selecting text based on its semantic similarity to the available in-domain text and 2) diversifying the vocabulary of the selected text using a greedy-search algorithm. Experimental results show that our proposed method outperforms the conventional text augmentation approach, with the relative reduction of word-error-rate ranging from 6% to 25% on the LibriSpeech dataset and 15% on a low-resource Vietnamese dataset.
- INTERSPEECHSpoofing Attacker Also Benefits from Large-Scale Self-Supervised ModelsAoi Ito* and Shota Horiguchi*In The Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023(*) Equal contribution
Large-scale pretrained models using self-supervised learning have reportedly improved the performance of speech anti-spoofing. However, the attacker side may also make use of such models. Also, since it is very expensive to train such models from scratch, pretrained models on the Internet are often used, but the attacker and defender may possibly use the same pretrained model. This paper investigates whether the improvement in anti-spoofing with pretrained models holds under the condition that the models are available to attackers. As the attacker, we train a model that enhances spoofed utterances so that the speaker embedding extractor based on the pretrained models cannot distinguish between bona fide and spoofed utterances. Experimental results show that the gains the anti-spoofing models obtained by using the pretrained models almost disappear if the attacker also makes use of the pretrained models.
- INTERSPEECHCAPTDURE: Captioned Sound Dataset of Single SourcesYuki Okamoto, Kanta Shimonishi, Keisuke Imoto, Kota Dohi, Shota Horiguchi, and Yohei KawaguchiIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Aug 2023
In conventional studies on environmental sound separation and synthesis using captions, sound datasets consisting of captions for multiple-source sounds were used for model training. However, in the case of we collect the captions for multiple-source sound, we cannot collect the detailed captions for each sound source. Therefore, it is difficult to extract only the single-source target sound by the model-training method using a conventional captioned sound dataset. We constructed a dataset with captions for a single-source sound that can be used in various tasks that involve environmental sounds, such as environmental sound synthesis. Our dataset consists of 1,044 audio samples and 4,902 captions. We also conducted environmental sound extraction experiments using our dataset and evaluated the performance. The experimental results indicate that the captions for a single-source sound are effective in extracting only the single-source target sound from the mixture sound.
- SLTMutual Learning of Single- and Multi-Channel End-to-End Neural DiarizationShota Horiguchi, Yuki Takashima, Shinji Watanabe, and Paola GarcíaIn IEEE Spoken Language Technology Workshop (SLT), Jan 2023
Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-channel model as teacher labels when training a single-channel model with knowledge distillation. To the contrary, it is also known that single-channel speech data can benefit multi-channel models by mixing it with multi-channel speech data during training or by using it for model pretraining. This paper focuses on speaker diarization and proposes to conduct the above bi-directional knowledge transfer alternately. We first introduce an end-to-end neural diarization model that can handle both single- and multi-channel inputs. Using this model, we alternately conduct i) knowledge distillation from a multi-channel model to a single-channel model and ii) finetuning from the distilled single-channel model to a multi-channel model. Experimental results on two-speaker data show that the proposed method mutually improved single- and multi-channel speaker diarization performances.
2022
- TASLPEncoder-Decoder Based Attractors for End-to-End Neural DiarizationShota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Paola GarcíaIEEE/ACM Transactions on Audio, Speech, and Language Processing, Mar 2022🏆 Itakura Prize Innovative Young Researcher Award
This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional cascaded approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based attractor calculation module (EDA) to EEND. Once frame-wise embeddings are obtained, EDA sequentially generates speaker-wise attractors on the basis of a sequence-to-sequence method using an LSTM encoder-decoder. The attractor generation continues until a stopping condition is satisfied; thus, the number of attractors can be flexible. Diarization results are then estimated as dot products of the attractors and embeddings. The embeddings from speaker overlaps result in larger dot product values with multiple attractors; thus, this method can deal with speaker overlaps. Because the maximum number of output speakers is still limited by the training set, we also propose an iterative inference method to remove this restriction. Further, we propose a method that aligns the estimated diarization results with the results of an external speech activity detector, which enables fair comparison against cascaded approaches. Extensive evaluations on simulated and real datasets show that EEND-EDA outperforms the conventional cascaded approach.
- INTERSPEECHUpdating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR ModelsYuki Takashima, Shota Horiguchi, Shinji Watanabe, Paola Garcia, and Yohei KawaguchiIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2022
In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model. Conventional approaches require extra parameters of the same size as the model for optimization, and it is difficult to apply these approaches to end-to-end ASR models because they have a huge amount of parameters. To solve this problem, we first investigate which parts of end-to-end ASR models contribute to high accuracy in the target domain while preventing catastrophic forgetting. We conduct experiments on incremental domain adaptation from the LibriSpeech dataset to the AMI meeting corpus with two popular end-to-end ASR models and found that adapting only the linear layers of their encoders can prevent catastrophic forgetting. Then, on the basis of this finding, we develop an element-wise parameter selection focused on specific layers to further reduce the number of fine-tuning parameters. Experimental results show that our approach consistently prevents catastrophic forgetting compared to parameter selection from the whole model.
- ICMLRethinking Fano’s Inequality in Ensemble LearningTerufumi Morishita, Gaku Morio, Shota Horiguchi, Hiroaki Ozaki, and Nobuo NukagaIn International Conference on Machine Learning (ICML), Jul 2022
We propose a fundamental theory on ensemble learning that evaluates a given ensemble system by a well-grounded set of metrics. Previous studies used a variant of Fano’s inequality of information theory and derived a lower bound of the classification error rate on the basis of the accuracy and diversity of models. We revisit the original Fano’s inequality and argue that the studies did not take into account the information lost when multiple model predictions are combined into a final prediction. To address this issue, we generalize the previous theory to incorporate the information loss. Further, we empirically validate and demonstrate the proposed theory through extensive experiments on actual systems. The theory reveals the strengths and weaknesses of systems on each metric, which will push the theoretical understanding of ensemble learning and give us insights into designing systems.
- OdysseyImproving the Naturalness of Simulated Conversations for End-to-End Neural DiarizationNatsuo Yamashita, Shota Horiguchi, and Takeshi HommaIn The Speaker and Language Recognition Workshop (Odyssey), Jun 2022
This paper investigates a method for simulating natural conversation in the model training of end-to-end neural diarization (EEND). Due to the lack of any annotated real conversational dataset, EEND is usually pretrained on a large-scale simulated conversational dataset first and then adapted to the target real dataset. Simulated datasets play an essential role in the training of EEND, but as yet there has been insufficient investigation into an optimal simulation method. We thus propose a method to simulate natural conversational speech. In contrast to conventional methods, which simply combine the speech of multiple speakers, our method takes turn-taking into account. We define four types of speaker transition and sequentially arrange them to simulate natural conversations. The dataset simulated using our method was found to be statistically similar to the real dataset in terms of the silence and overlap ratios. The experimental results on two-speaker diarization using the CALLHOME and CSJ datasets showed that the simulated dataset contributes to improving the performance of EEND.
- ICASSPMulti-Channel End-to-End Neural Diarization with Distributed MicrophonesShota Horiguchi, Yuki Takashima, Paola García, Shinji Watanabe, and Yohei KawaguchiIn IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022
Recent progress on end-to-end neural diarization (EEND) has en-abled overlap-aware speaker diarization with a single neural net-work. This paper proposes to enhance EEND by using multi-channel signals from distributed microphones. We replace Transformer en-coders in EEND with two types of encoders that process a multi-channel input: spatio-temporal and co-attention encoders. Both are independent of the number and geometry of microphones and suitable for distributed microphone settings. We also propose a model adaptation method using only single-channel recordings. With simulated and real-recorded datasets, we demonstrated that the proposed method outperformed conventional EEND when a multi-channel in-put was given while maintaining comparable performance with a single-channel input. We also showed that the proposed method performed well even when spatial information is inoperative given multi-channel inputs, such as in hybrid meetings in which the utterances of multiple remote participants are played back from the same loudspeaker.
- ICASSPEnvironmental Sound Extraction Using Onomatopoeic WordsYuki Okamoto, Shota Horiguchi, Masaaki Yamamoto, Keisuke Imoto, and Yohei KawaguchiIn IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022🏆 IEEE SPS Japan Student Conference Paper Award
An onomatopoeic word, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre. We propose an environmental-sound-extraction method using onomatopoeic words to specify the target sound to be extracted. By this method, we estimate a time-frequency mask from an input mixture spectrogram and an onomatopoeic word using a U-Net architecture, then extract the corresponding target sound by masking the spectrogram. Experimental results indicate that the proposed method can extract only the target sound corresponding to the onomatopoeic word and performs better than conventional methods that use sound-event classes to specify the target sound.
2021
- ASRUTowards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local AttractorsShota Horiguchi, Paola García, Shinji Watanabe, Yawen Xue, Yuki Takashima, and Yohei KawaguchiIn IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2021
Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited. Experimental results showed that our method could produce accurate diarization results of an unseen number of speakers. Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets, respectively, each of which is better than the conventional end-to-end diarization methods.
- INTERSPEECHOnline Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of SpeakersYawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Paola Garcia, and Kenji NagamatsuIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2021
We propose a streaming diarization method based on an end-to-end neural diarization (EEND) model, which handles flexible numbers of speakers and overlapping speech. In our previous study, the speaker-tracing buffer (STB) mechanism was proposed to achieve a chunk-wise streaming diarization using a pre-trained EEND model. STB traces the speaker information in previous chunks to map the speakers in a new chunk. However, it only worked with two-speaker recordings. In this paper, we propose an extended STB for flexible numbers of speakers, FLEX-STB. The proposed method uses a zero-padding followed by speaker-tracing, which alleviates the difference in the number of speakers between a buffer and a current chunk. We also examine buffer update strategies to select important frames for tracing multiple speakers. Experiments on CALLHOME and DIHARD II datasets show that the proposed method achieves comparable performance to the offline EEND method with 1-second latency. The results also show that our proposed method outperforms recently proposed chunk-wise diarization methods based on EEND (BW-EDA-EEND).
- INTERSPEECHSemi-Supervised Training with Pseudo-Labeling for End-to-End Neural DiarizationYuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Paola Garcia, and Kenji NagamatsuIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2021
In this paper, we present a semi-supervised training technique using pseudo-labeling for end-to-end neural diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. However, to get a well-tuned model, EEND requires labeled data for all the joint speech activities of every speaker at each time frame in a recording. In this paper, we explore a pseudo-labeling approach that employs unlabeled data. First, we propose an iterative pseudo-label method for EEND, which trains the model using unlabeled data of a target condition. Then, we also propose a committee-based training method to improve the performance of EEND. To evaluate our proposed method, we conduct the experiments of model adaptation using labeled and unlabeled data. Experimental results on the CALLHOME dataset show that our proposed pseudo-label achieved a 37.4% relative diarization error rate reduction compared to a seed model. Moreover, we analyzed the results of semi-supervised adaptation with pseudo-labeling. We also show the effectiveness of our approach on the third DIHARD dataset.
- ICASSPEnd-to-End Speaker Diarization as Post-ProcessingShota Horiguchi, Paola García, Yusuke Fujita, Shinji Watanabe, and Kenji NagamatsuIn IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2021
This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification. Although some methods can treat a flexible number of speakers, they do not perform well when the number of speakers is large. To compensate for each other’s weakness, we propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method. We iteratively select two speakers from the results and update the results of the two speakers to improve the overlapped region. Experimental results show that the proposed algorithm consistently improved the performance of the state-of-the-art methods across CALLHOME, AMI, and DIHARD II datasets.
- SLTEnd-to-End Speaker Diarization Conditioned on Speech Activity and Overlap DetectionYuki Takashima, Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Paola Garcia, and Kenji NagamatsuIn IEEE Spoken Language Technology Workshop (SLT), Jan 2021
In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. In this paper, to further improve the performance of the EEND system, we propose a novel multitask learning framework that solves speaker diarization and a desired subtask while explicitly considering the task dependency. We optimize speaker diarization conditioned on speech activity and overlap detection that are subtasks of speaker diarization, based on the probabilistic chain rule. Experimental results show that our proposed method can leverage a subtask to effectively model speaker diarization, and outperforms conventional EEND systems in terms of diarization error rate.
- SLTOnline End-to-End Neural Diarization with Speaker-Tracing BufferYawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Paola Garcia, and Kenji NagamatsuIn IEEE Spoken Language Technology Workshop (SLT), Jan 2021
This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker’s permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames representing the speaker permutation information from previous chunks and stores them in a buffer. These buffered frames are stacked with the input frames in the current chunk and fed into a self-attention network. Our method ensures consistent diarization outputs across the buffer and the current chunk by checking the correlation between their corresponding outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate the mismatch between training and inference introduced by the speaker-tracing buffer mechanism. Experimental results, including online SA-EEND and variable chunk-size, achieved DERs of 12.54% for CALLHOME and 20.77% for CSJ with 1.4 s actual latency.
- SLTBlock-Online Guided Source SeparationShota Horiguchi, Yusuke Fujita, and Kenji NagamatsuIn IEEE Spoken Language Technology Workshop (SLT), Jan 2021
We propose a block-online algorithm of guided source separation (GSS). GSS is a speech separation method that uses diarization information to update parameters of the generative model of observation signals. Previous studies have shown that GSS performs well in multi-talker scenarios. However, it requires a large amount of calculation time, which is an obstacle to the deployment of online applications. It is also a problem that the offline GSS is an utterance-wise algorithm so that it produces latency according to the length of the utterance. With the proposed algorithm, block-wise input samples and corresponding time annotations are concatenated with those in the preceding context and used to update the parameters. Using the context enables the algorithm to estimate time-frequency masks accurately only from one iteration of optimization for each block, and its latency does not depend on the utterance length but predetermined block length. It also reduces calculation cost by updating only the parameters of active speakers in each block and its context. Evaluation on the CHiME-6 corpus and a meeting corpus showed that the proposed algorithm achieved almost the same performance as the conventional offline GSS algorithm but with 32x faster calculation, which is sufficient for real-time applications.
- DIHARDThe Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-vector Clustering Systems Combined by DOVER-LapShota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, and Sanjeev KhudanpurIn The Third DIHARD Speech Diarization Challenge (DIHARD III), Jan 2021
This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After the DOVER-Lap based system combination, it achieved diarization error rates of 11.58 % and 14.09 % in Track 1 full and core, and 16.94 % and 20.01 % in Track 2 full and core, respectively. With their results, we won second place in all the tasks of the challenge.
2020
- TPAMISignificance of Softmax-based Features in Comparison to Distance Metric Learning-based FeaturesShota Horiguchi, Daiki Ikami, and Kiyoharu AizawaIEEE Transactions on Pattern Analysis and Machine Intelligence, May 2020
End-to-end distance metric learning (DML) has been applied to obtain features useful in many computer vision tasks. However, these DML studies have not provided equitable comparisons between features extracted from DML-based networks and softmax-based networks. In this paper, we present objective comparisons between these two approaches under the same network architecture.
- INTERSPEECHUtterance-Wise Meeting Transcription System Using Asynchronous Distributed MicrophonesShota Horiguchi, Yusuke Fujita, and Kenji NagamatsuIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Oct 2020
A novel framework for meeting transcription using asynchronous microphones is proposed in this paper. It consists of audio synchronization, speaker diarization, utterance-wise speech enhancement using guided source separation, automatic speech recognition, and duplication reduction. Doing speaker diarization before speech enhancement enables the system to deal with overlapped speech without considering sampling frequency mismatch between microphones. Evaluation on our real meeting datasets showed that our framework achieved a character error rate (CER) of 28.7% by using 11 distributed microphones, while a monaural microphone placed on the center of the table had a CER of 38.2%. We also showed that our framework achieved CER of 21.8%, which is only 2.1 percentage points higher than the CER in headset microphone-based transcription.
- INTERSPEECHEnd-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based AttractorsShota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, and Kenji NagamatsuIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Oct 2020
End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors are multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69% diarization error rate (DER) on simulated mixtures and a 8.07% DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56% and 9.54%, respectively. In unknown numbers of speakers conditions, our method attained a 15.29% DER on CALLHOME, while the x-vector-based clustering method achieved a 19.43% DER.
- ICRAAnticipating the Start of User Interaction for Service Robot in the WildKoichiro Ito, Quan Kong, Shota Horiguchi, Takashi Sumiyoshi, and Kenji NagamatsuIn IEEE International Conference on Robotics and Automation (ICRA), Jun 2020
A service robot is expected to provide proactive service for visitors who require its help. In contrast to passive service, e.g., providing service only after being spoken to, proactive service initiates an interaction at an early stage, e.g., talking to potential visitors who need the robot’s help in advance. This paper addresses how to anticipate the start of user interaction. We propose an approach using only a single RGB camera that anticipates whether a visitor will come to the robot for interaction or just pass it by. In the proposed approach, we (i) utilize the visitor’s pose information from captured images incorporating facial information, (ii) train a CNN-LSTM–based model in an end-to-end manner with an exponential loss for early anticipation, and (iii) during the training, the network branch for facial keypoints acquired as the part of the human pose information is taught to mimic the branch trained with the face image from a specialized face detector with a human verification. By virtue of (iii), at the inference, we can run our model in an embedded system processing only the pose information without an additional face detector and typical accuracy drop.We evaluated the proposed approach on our collected real world data with a real service robot and publicly available JPL interaction dataset and found that it achieved accurate anticipation performance.
- SemEvalHitachi at SemEval-2020 Task 8: Simple but Effective Modality Ensemble for Meme Emotion RecognitionTerufumi Morishita*, Gaku Morio*, Shota Horiguchi, Hiroaki Ozaki, and Toshinori MiyoshiIn The Forteenth Workshop on Semantic Evaluation (SemEval), Dec 2020(*) Equal contribution
Users of social networking services often share their emotions via multi-modal content, usually images paired with text embedded in them. SemEval-2020 task 8, Memotion Analysis, aims at automatically recognizing these emotions of so-called internet memes. In this paper, we propose a simple but effective Modality Ensemble that incorporates visual and textual deep-learning models, which are independently trained, rather than providing a single multi-modal joint network. To this end, we first fine-tune four pre-trained visual models (i.e., Inception-ResNet, PolyNet, SENet, and PNASNet) and four textual models (i.e., BERT, GPT-2, Transformer-XL, and XLNet). Then, we fuse their predictions with ensemble methods to effectively capture cross-modal correlations. The experiments performed on dev-set show that both visual and textual features aided each other, especially in subtask-C, and consequently, our system ranked 2nd on subtask-C.
- CHiMECHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented RecordingsShinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, and Neville RyantIn The 6th International Workshop on Speech Processing in Everyday Environments (CHiME-2020), May 2020
Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.
- PreprintNeural Speaker Diarization with Speaker-Wise Chain RuleYusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, and Nagamatsu KenjiarXiv:2006.01796, Jun 2020
Speaker diarization is an essential step for processing multi-speaker audio. Although an end-to-end neural diarization (EEND) method achieved state-of-the-art performance, it is limited to a fixed number of speakers. In this paper, we solve this fixed number of speaker issue by a novel speaker-wise conditional inference method based on the probabilistic chain rule. In the proposed method, each speaker’s speech activity is regarded as a single random variable, and is estimated sequentially conditioned on previously estimated other speakers’ speech activities. Similar to other sequence-to-sequence models, the proposed method produces a variable number of speakers with a stop sequence condition. We evaluated the proposed method on multi-speaker audio recordings of a variable number of speakers. Experimental results show that the proposed method can correctly produce diarization results with a variable number of speakers and outperforms the state-of-the-art end-to-end speaker diarization methods in terms of diarization error rate.
- PreprintEnd-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-Label ClassificationYusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, and Nagamatsu KenjiarXiv:2003.20966, Feb 2020
The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these problems, we propose the End-to-End Neural Diarization (EEND), in which a neural network directly outputs speaker diarization results given a multi-speaker recording. To realize such an end-to-end model, we formulate the speaker diarization problem as a multi-label classification problem and introduce a permutation-free objective function to directly minimize diarization errors. Besides its end-to-end simplicity, the EEND method can explicitly handle speaker overlaps during training and inference. Just by feeding multi-speaker recordings with corresponding speaker segment labels, our model can be easily adapted to real conversations. We evaluated our method on simulated speech mixtures and real conversation datasets. The results showed that the EEND method outperformed the state-of-the-art x-vector clustering-based method, while it correctly handled speaker overlaps. We explored the neural network architecture for the EEND method, and found that the self-attention-based neural network was the key to achieving excellent performance. In contrast to conditioning the network only on its previous and next hidden states, as is done using bidirectional long short-term memory (BLSTM), self-attention is directly conditioned on all the frames. By visualizing the attention weights, we show that self-attention captures global speaker characteristics in addition to local speech activity dynamics, making it especially suitable for dealing with the speaker diarization problem.
2019
- ASRUSimultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic ModelsNaoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji Nagamatsu, and Shinji WatanabeIn IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2019
This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the speakers in the recordings are unknown because it requires a sample of the target speakers in advance of decoding. To remove this limitation, we propose an iterative method, in which (i) the estimation of speaker embeddings and (ii) TS-ASR based on the estimated speaker embeddings are alternately executed. We evaluated the proposed method by using very challenging dialogue recordings in which the speaker overlap ratio was over 20%. We confirmed that the proposed method significantly reduced both the word error rate (WER) and diarization error rate (DER). Our proposed method combined with i-vector speaker embeddings ultimately achieved a WER that differed by only 2.1 % from that of TS-ASR given oracle speaker embeddings. Furthermore, our method can solve speaker diarization simultaneously as a by-product and achieved better DER than that of the conventional clustering-based speaker diarization method based on i-vector.
- ASRUEnd-to-End Neural Speaker Diarization with Self-AttentionYusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, and Shinji WatanabeIn IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec 2019
Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLSTM) network directly outputs speaker diarization results given a multi-talker recording, was recently proposed. In this study, we enhance EEND by introducing self-attention blocks instead of BLSTM blocks. In contrast to BLSTM, which is conditioned only on its previous and next hidden states, self-attention is directly conditioned on all the other frames, making it much suitable for dealing with the speaker diarization problem. We evaluated our proposed method on simulated mixtures, real telephone calls, and real dialogue recordings. The experimental results revealed that the self-attention was the key to achieving good performance and that our proposed method performed significantly better than the conventional BLSTM-based method. Our method was even better than that of the state-of-the-art x-vector clustering-based method. Finally, by visualizing the latent representation, we show that the self-attention can capture global speaker characteristics in addition to local speech activity dynamics. Our source code is available online at https://github.com/hitachi-speech/EEND.
- INTERSPEECHEnd-to-End Neural Speaker Diarization with Permutation-Free ObjectivesYusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, and Shinji WatanabeIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2019
In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem as a multi-label classification problem, and introduces a permutation-free objective function to directly minimize diarization errors without being suffered from the speaker-label permutation problem. Besides its end-to-end simplicity, the proposed method also benefits from being able to explicitly handle overlapping speech during training and inference. Because of the benefit, our model can be easily trained/adapted with real-recorded multi-speaker conversations just by feeding the corresponding multi-speaker segment labels. We evaluated the proposed method on simulated speech mixtures. The proposed method achieved diarization error rate of 12.28%, while a conventional clustering-based system produced diarization error rate of 28.77%. Furthermore, the domain adaptation with real-recorded speech provided 25.6% relative improvement on the CALLHOME dataset.
- INTERSPEECHMultimodal Response Obligation Detection with Unsupervised Online Domain AdaptationShota Horiguchi, Naoyuki Kanda, and Kenji NagamatsuIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2019
Response obligation detection, which determines whether a dialogue robot has to respond to a detected utterance, is an important function for intelligent dialogue robots. Some studies have tackled this problem; however, they narrow their applicability by impractical assumptions or use of scenario-specific features. Some attempts have been made to widen the applicability by avoiding the use of text modality, which is said to be highly domain dependent, but it decreases the detection accuracy. In this paper, we propose a novel multimodal response obligation detector, which uses visual, audio, and text information for highly-accurate detection, with its unsupervised online domain adaptation to solve the domain dependency problem. Our domain adaptation consists of the weights adaptation of the logistic regression for every modality and an embedding assignment for new words to cope with the high domain dependency of text modality. Experimental results on the dataset collected at a station and commercial building showed that our method achieved high response obligation detection accuracy and was able to handle domain change automatically.
- INTERSPEECHAuxiliary Interference Speaker Loss for Target-Speaker Speech RecognitionNaoyuki Kanda, Shota Horiguchi, Ryoichi Takashima, Yusuke Fujita, Kenji Nagamatsu, and Shinji WatanabeIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2019
In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR). Our method automatically extracts and transcribes target speaker’s utterances from a monaural mixture of multiple speakers speech given a short sample of the target speaker. The proposed auxiliary loss function attempts to additionally maximize interference speaker ASR accuracy during training. This will regularize the network to achieve a better representation for speaker separation, thus achieving better accuracy on the target-speaker ASR. We evaluated our proposed method using two-speaker-mixed speech in various signal-to-interference-ratio conditions. We first built a strong target-speaker ASR baseline based on the state-of-the-art lattice-free maximum mutual information. This baseline achieved a word error rate (WER) of 18.06% on the test set while a normal ASR trained with clean data produced a completely corrupted result (WER of 84.71%). Then, our proposed loss further reduced the WER by 6.6% relative to this strong baseline, achieving a WER of 16.87%. In addition to the accuracy improvement, we also showed that the auxiliary output branch for the proposed loss can even be used for a secondary ASR for interference speakers’ speech.
- INTERSPEECHGuided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ScenarioNaoyuki Kanda, Christoph Boeddeker, Jens Heitkaemper, Yusuke Fujita, Shota Horiguchi, Kenji Nagamatsu, and Reinhold Haeb-UmbachIn The Annual Conference of the International Speech Communication Association (INTERSPEECH), Sep 2019
In this paper, we present Hitachi and Paderborn University’s joint effort for automatic speech recognition (ASR) in a dinner party scenario. The main challenges of ASR systems for dinner party recordings obtained by multiple microphone arrays are (1) heavy speech overlaps, (2) severe noise and reverberation, (3) very natural conversational content, and possibly (4) insufficient training data. As an example of a dinner party scenario, we have chosen the data presented during the CHiME-5 speech recognition challenge, where the baseline ASR had a 73.3% word error rate (WER), and even the best performing system at the CHiME-5 challenge had a 46.1% WER. We extensively investigated a combination of the guided source separation-based speech enhancement technique and an already proposed strong ASR backend and found that a tight combination of these techniques provided substantial accuracy improvements. Our final system achieved WERs of 39.94% and 41.64% for the development and evaluation data, respectively, both of which are the best published results for the dataset. We also investigated with additional training data on the official small data in the CHiME-5 corpus to assess the intrinsic difficulty of this ASR task.
- ICASSPAcoustic Modeling for Distant Multi-Talker Speech Recognition with Single- and Multi-Channel BranchesNaoyuki Kanda, Yusuke Fujita, Shota Horiguchi, Rintaro Ikeshita, Kenji Nagamatsu, and Shinji WatanabeIn IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019
This paper presents a novel heterogeneous-input multi-channel acoustic model (AM) that has both single-channel and multi-channel input branches. In our proposed training pipeline, a single-channel AM is trained first, then a multi-channel AM is trained starting from the single-channel AM with a randomly initialized multi-channel input branch. Our model uniquely uses the power of a complemen-tal speech enhancement (SE) module while exploiting the power of jointly trained AM and SE architecture. Our method was the foundation for the Hitachi/JHU CHiME-5 system that achieved the second-best result in the CHiME-5 competition, and this paper details various investigation results that we were not able to present during the competition period. We also evaluated and reconfirmed our method’s effectiveness with the AMI Meeting Corpus. Our AM achieved a 30.12% word error rate (WER) for the development set and a 32.33% WER for the evaluation set for the AMI Corpus, both of which are the best results ever reported to the best of our knowledge.
- WACVOmnidirectional Pedestrian Detection by Rotation Invariant TrainingMasato Tamura, Shota Horiguchi, and Tomokazu MurakamiIn IEEE Winter Conference on Applications of Computer Vision (WACV), Jan 2019
Recently much progress has been made in pedestrian detection by utilizing the learning ability of convolutional neural networks (CNNs). However, due to the lack of omnidirectional images to train CNNs, few CNN-based detectors have been proposed for omnidirectional pedestrian detection. One significant difference between omnidirectional images and perspective images is that the appearance of pedestrians is rotated in omnidirectional images. A previous method has dealt with this by transforming omnidirectional images into perspective images in the test phase. However, this method has significant drawbacks, namely, the computational cost and the performance degradation caused by the transformation. To address this issue, we propose a rotation invariant training method, which only uses randomly rotated perspective images without any additional annotation. By this method, existing large-scale datasets can be utilized. In test phase, omnidirectional images can be used without the transformation. To group predicted bounding boxes, we also develop a bounding box refinement, which works better for our detector than non-maximum suppression. The proposed detector achieved a state-of-the-art performance on four public benchmarks.
2018
- TMMPersonalized Classifier for Food Image RecognitionShota Horiguchi, Sosuke Amano, Makoto Ogawa, and Kiyoharu AizawaIEEE Transactions on Multimedia, Oct 2018
Currently, food image recognition tasks are evaluated against fixed datasets. However, in real-world conditions, there are cases in which the number of samples in each class continues to increase and samples from novel classes appear. In particular, dynamic datasets in which each individual user creates samples and continues the updating process often has content that varies considerably between different users, and the number of samples per person is very limited. A single classifier common to all users cannot handle such dynamic data. Bridging the gap between the laboratory environment and the real world has not yet been accomplished on a large scale. Personalizing a classifier incrementally for each user is a promising way to do this. In this paper, we address the personalization problem, which involves adapting to the user’s domain incrementally using a very limited number of samples. We propose a simple yet effective personalization framework, which is a combination of the nearest class mean classifier and the 1-nearest neighbor classifier based on deep features. To conduct realistic experiments, we made use of a new dataset of daily food images collected by a food-logging application. Experimental results show that our proposed method significantly outperforms existing methods.
- ACMMMFace-Voice Matching Using Cross-Modal EmbeddingsShota Horiguchi, Naoyuki Kanda, and Kenji NagamatsuIn ACM International Conference on Multimedia (ACMMM), Oct 2018
Face-voice matching is a task to find correspondence between faces and voices. Many researches in cognitive science have confirmed human ability in the face-voice matching tasks. Such ability is useful for creating natural human machine interaction systems and in many other applications. In this paper, we propose a face-voice matching model that learns cross-modal embeddings between face images and voice characteristics. We constructed a novel FVCeleb dataset which consists of face images and utterances from 1,078 persons. These persons were selected from the MS-Celeb-1M face image dataset and the VoxCeleb audio dataset. In two-alternative forced-choice matching task with an audio input and two face-image candidates of the same gender, our model achieved 62.2% and 56.5% accuracy on the FVCeleb and the subset of the GRID corpus, respectively. These results are very similar to human performance reported in cognitive science studies.
- CHiMEThe Hitachi/JHU CHiME-5 System: Advances in Speech Recognition for Everyday Home Environments Using Multiple Microphone ArraysNaoyuki Kanda, Rintaro Ikeshita, Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu, Xiaofei Wang, Vimal Manohar, Nelson Enrique Yalta Soplin, Matthew Maciejewski, Szu-Jui Chen, Aswin Shanmugam Subramanian, Ruizhi Li, Zhiqi Wang, Jason Naradowsky, L. Paola Garcia-Perera, and Gregory SellIn The 5th International Workshop on Speech Processing in Everyday Environments (CHiME-2018), Sep 2018
This paper presents Hitachi and JHU’s efforts on developing CHiME-5 system to recognize dinner party speeches recorded by multiple microphone arrays. We newly developed (1) the way to apply multiple data augmentation methods, (2) residual bidirectional long short-term memory, (3) 4-ch acoustic models, (4) multiple-array combination methods, (5) hypothesis deduplication method, and (6) speaker adaptation technique of neural beamformer. As the results, our best system in category B achieved 52.38% of word error rates (WERs) for development set, which corresponded to 35% of relative WER reduction from the state-of-the-art baseline. Our best system also achieved 48.20% of WER for evaluation set, which was the 2nd best result in the CHiME-5 competition.
2016
- Food Search Based on User Feedback to Assist Image-Based Food Recording SystemsSosuke Amano, Shota Horiguchi, Kiyoharu Aizawa, Kazuki Maeda, Masanori Kubota, and Makoto OgawaIn International Workshop On Multimedia Assisted Dietary Management (MADiMa), Oct 2016
Food diaries or diet journals are thought to be effective for improving the dietary lives of users. One important challenge in this field involves assisting users in recording their daily food intake. In recent years, food image recognition has attracted a considerable amount of research interest as a new technology to help record users ’food intake. However, since there are so many types of food, and it is unrealistic to expect a system to recognize all foods. In this paper, we propose an optimal combination of image recognition and interactive search in order to record users ’intake of food. The image recognition generates a list of candidate names for a given food picture. The user chooses the closest name to the meal, which triggers an associative food search based on food contents, such as ingredients. We show the proposed system is efficient to assist users maintain food journals.
- ICIPThe Log-Normal Distribution of the Size of Objects in Daily Meal Images and Its Application to the Efficient Reduction of Object ProposalsShota Horiguchi, Kiyoharu Aizawa, and Makoto OgawaIn IEEE International Conference on Image Processing (ICIP), Sep 2016
In general, object-detection methods apply classifiers to pre-calculated object proposals. It is therefore important to minimize the number of proposals to achieve computational efficiency. In this paper, we show that the region size for food objects in recorded images of daily food follows a lognormal distribution, which is different from the distribution for widely used datasets collected by querying the names of dishes. We explain this characteristic using Gibrat’s law, and construct a model for the region-size distribution of objects in images. We applied the model to the filtering of object proposals generated by selective search and edge boxes. We obtained a significant reduction of 40.6% in the number of hypotheses compared with a conventional selective search, despite a decrease of only 0.007 in the Mean Average Best Overlap.