Keyword Extraction of Scientific Journal Abstracts Using TF-IDF and KeyBERT Methods
https://doi.org/10.58466/aicoms.v4i2.1814
Keywords:
Cosine Similarity, Information Retrieval, Keyword Extraction, KeyBERT, Semantic Similarity, TF-IDFAbstract
Keyword extraction is a significant technique in natural language processing (NLP) that serves to summarize the essence of a document, such as a scientific journal summary. This study aims to analyze the effectiveness of two keyword extraction methods, namely Term Frequency-Inverse Document Frequency (TF-IDF) and KeyBERT, in finding significant keywords from a collection of scientific journal abstracts. The dataset used consists of several scientific journal abstracts accompanied by manual keywords as a basis for assessment. The TF-IDF method relies on the frequency of words in the document, while KeyBERT utilizes a cosine similarity approach based on the BERT transformer model to determine the most meaningful keywords. The research findings show that the KeyBERT method and the TF-IDF method have a moderate level of similarity with semantic similarity values of 0.578 for the KeyBERT method and 0.469 for the TF-IDF method, respectively. These results show significant potential for the use of machine learning and deep learning-based models with both methods for topic classification systems, especially in the fields of information retrieval and text mining.
References
B. Bahar, “Pengembangan Model Sistem Informasi Manajemen Pengelolaan Artikel Ilmiah Berbasis Web Menggunakan Metode Extreme Programming,” Jutisi J. Ilm. Tek. Inform. dan Sist. Inf., vol. 9, no. 3, p. 1, 2021, doi: 10.35889/jutisi.v9i3.537.
M. A. Shiddiq, “Ekstraksi Kata Kunci pada Artikel Menggunakan Metode Textrank,” vol. 1, no. 1, pp. 1–97, 2019.
A. Kurniawan, “Aplikasi sistem ekstraksi kata kunci berbahasa indonesia menggunakan algoritma textrank studi kasus data wikipedia Indonesia,” Repository.Uinjkt.Ac.Id, 2021.
R. Al Rasyid and D. H. U. Ningsih, “Penerapan Algoritma TF-IDF dan Cosine Similarity untuk Query Pencarian Pada Dataset Destinasi Wisata,” J. JTIK (Jurnal Teknol. Inf. dan Komunikasi), vol. 8, no. 1, pp. 170–178, 2024, doi: 10.35870/jtik.v8i1.1416.
R. Mihalcea and P. Tarau, “TextRank: Bringing order into texts,” Proc. 2004 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2004 - A Meet. SIGDAT, a Spec. Interes. Gr. ACL held conjunction with ACL 2004, vol. 85, pp. 404–411, 2004.
M. Ciaramita and M. Johnson, “Supersense Tagging of Unknown Nouns in WordNet,” Proc. 2003 Conf. Empir. Methods Nat. Lang. Process. EMNLP 2003, pp. 168–175, 2003, doi: 10.3115/1119355.1119377.
C. Argueta and Y. S. Chen, “Multi-Lingual Sentiment Analysis of Social Data Based on Emotion-Bearing Patterns,” Soc. 2014 - 2nd Work. Nat. Lang. Process. Soc. Media, conjunction with COLING 2014, no. 101, pp. 38–43, 2014, doi: 10.3115/v1/w14-5906.
M. Grootendorst, “KeyBERT: Minimal keyword extraction with BERT.,” Zenodo. [Online]. Available: https://github.com/MaartenGr/KeyBERT
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 3982–3992, 2019, doi: 10.18653/v1/d19-1410.
A. Bert, “Applied Information Technology and Computer Science Sentimen Analisis Overclaim Skincare Skintific Menggunakan,” vol. 3, no. 2, pp. 1–5, 2024.
J. Newburn, “Keyword Extraction in Bert-Based Models for Reviewer System,” no. May, 2023.
A. Yovi, S. Adi, E. Alexander, U. Katolik, D. Cendika, and K. Ngasem, “Merangkum Teks Word Dan Pdf,” vol. 12, no. 1, 2024.
Nabil Haidarrahman Pribadi, “Sistem Rekomendasi Karya IlmiahBerdasarkan Semantic SimilarityMenggunakan Fasttext Dan MetodeWord Mover’S Distance,” Insitur Teknol. Sepuluh Nop., 2020.
MEILINAEKA, “Web Scraping : Pengertian dan Fungsinya dalam Pengambilan Data.” [Online]. Available: https://it.telkomuniversity.ac.id/web-scraping-pengertian-dan-fungsinya-dalam-pengambilan-data/
J. Pragantha, T. Informatika, F. T. Informasi, and U. Tarumanagara, “Automatic Summarization Pada,” vol. 1, no. 1, pp. 71–78, 2017.
D. Darmanto, N. I. Pradasari, and E. Wahyudi, “Sistem Deteksi Plagiarisme Tugas Akhir Mahasiswa Berbasis Natural Language Processing Menggunakan Algoritma Jaro-Winkler dan TF-IDF,” Smart Comp: Jurnalnya Orang Pintar Komputer, vol. 13, no. 1, pp. 201–211, 2024.
A. Hexahost, “Text Preprocessing: Apa itu, Mengapa Penting, dan Bagaimana Melakukannya?,” Hexahost. [Online]. Available: https://hexahost.id/pengertian-text-preprocessing/
D. Septiani and I. Isabela, “Analisis Term Frequency Inverse Document Frequency (TF-IDF) Dalam Temu Kembali Informasi Pada Dokumen Teks,” SINTESIA J. Sist. dan Teknol. Inf. Indones., vol. 1, no. 2, pp. 81–88, 2023.
“Different Techniques for Sentence Semantic Similarity in NLP,” GeeksForGeeks. [Online]. Available: https://www.geeksforgeeks.org/different-techniques-for-sentence-semantic-similarity-in-nlp/
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Rayvin Suhartoyo, Valen Julyo, Hafiz Irsyad, Abdul Rahman

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


