How to Do Topic Modelling and Cuisine Classification Using NLP?

In this post, we aim to explore how NLP (Natural Language Processing) can be utilized to determine the culinary origin of an unfamiliar dish. We will explore two approaches: cuisine classification based on ingredients and topic modeling using meal definitions.

Firstly, we will delve into cuisine classification by examining the ingredients. We can employ NLP techniques to identify patterns and associations that align with specific world cuisines by analyzing the dish's composition. This method involves training a model on a dataset of labeled recipes from various cuisines. The model learns the distinctive ingredient combinations that characterize each cuisine, enabling it to make predictions on new, unseen dishes.

Additionally, we will explore topic modeling by analyzing meal definitions. Meal definitions provide insights into the cultural and contextual aspects of a dish. We can identify the key themes and topics associated with different cuisines by employing techniques like topic modeling. This approach involves extracting the latent topics in meal descriptions, allowing us to infer the likely culinary origin based on the identified themes.

By combining these two approaches, we can enhance the accuracy and robustness of our cuisine classification system. Using NLP in this context opens up exciting possibilities for automatically identifying the culinary heritage of dishes and expanding our knowledge and appreciation of diverse world cuisines.

What is NLP?

How-to-Do-Topic-Modelling-and-Cuisine-Classification-Using-NLP

Natural Language Processing (NLP) refers to the capability of artificial intelligence systems to comprehend, interpret, and manipulate human language as humans do. This field aims to enable machines to understand and effectively interact with human language, whether it is in the form of spoken words or written text. NLP finds applications in various domains, including developing chatbots for customer service in industries like airlines and banking, the spam filtering in email services like Google Mail, and voice-activated assistants like Siri on Apple devices.

NLP encompasses several vital components, such as speech recognition, which involves converting spoken language into the written text; natural language understanding, which focuses on comprehending the meaning and intent behind human language; and text generation, which involves the automatic generation of coherent and contextually appropriate text.

In this project, we will explore the fascinating field of NLP and delve into various aspects of it. We will examine techniques and algorithms used in speech recognition, natural language understanding, and text generation. By gaining insights into these areas, we can better appreciate the capabilities of NLP and its potential to enhance human-computer interaction and enable a wide range of applications. So, let's embark on this journey into Natural Language Processing!

Methodology

Scraping the Website Data

For this project, we gathered essential data from two popular websites, "BBC Food" and "Epicurious." To accomplish this, we employed web scraping techniques using the BeautifulSoup library, which allowed me to extract information from the websites efficiently. As a result, we acquired a comprehensive dataset comprising more than 5,000 entries, encompassing ingredients, explanations, and cooking methods for various dishes.

Using the collected dataset, we developed a machine-learning model tailored explicitly to the task. We utilized the data from the "Ingredients" column as the primary input for the model. Training the model on this information made it adept at recognizing and analyzing various ingredients in different dishes.

Data Processing

Before constructing the model, a data cleaning process was performed to ensure the quality and consistency of the dataset. Several steps were taken to clean the data effectively.

To begin, punctuation marks were removed from the text, and all letters were converted to lowercase. This step helps in standardizing the text and avoiding any discrepancies due to case sensitivity.

Next, numerical values indicating quantity were eliminated from the data since they are not relevant for our analysis. This ensures that the focus remains solely on the ingredients themselves.

Additionally, stopwords were removed from the text. Stopwords are commonly used words that do not contribute significant meaning to the overall context. By eliminating stopwords, we can reduce noise and focus on more meaningful words in the dataset.

By performing these data cleaning steps, we are able to create a cleaner and more streamlined dataset, which ultimately improves the accuracy and effectiveness of the machine learning model and topic modeling techniques applied to the data.

To further refine the dataset and reduce the word variety, the 'WordNetLemmatizer' function was employed. This process is crucial for the model as it helps reduce the number of words, which can positively impact the model's performance.

As part of the data preprocessing phase, an additional step was taken to remove rare words from the dataset. Some words that appeared infrequently or erroneously might have been collected during the web scraping process. To address this, the "Counter" function was imported to count the frequency of each word in the dataset.

Experimental Data Analysis

The graph illustrates the distribution of the target values, representing different world cuisines. It is evident that there is an imbalance in the dataset, where certain cuisines are more prevalent than others.

To address this issue and ensure a balanced representation of cuisines in the model, a strategy was implemented during the scraping process. Specifically, cuisines such as British and Irish, which exhibit significant similarities in terms of their culinary traditions, were grouped together as "British/Irish". Similarly, cuisines like Indian, Spanish, Pakistani, and Portuguese, which share commonalities in terms of ingredients and flavors, were combined as a single category.

By merging these similar cuisines, the dataset achieves a more balanced distribution among the target values. This is important for training the machine learning model, as it helps prevent bias towards overrepresented cuisines and ensures that all cuisines have a comparable impact on the learning process. Maintaining a balanced dataset enhances the model's ability to generalize and make accurate predictions across various cuisines.

The word cloud visualization effectively depicts the relationship between cuisines and their corresponding ingredients in the "Ingredients" column. By examining the word cloud, it becomes evident that different cuisines have distinct ingredients, reflecting their unique culinary characteristics.

Knowing world cuisines, the generated word cloud aligns with our expectations. It highlights specific ingredients commonly associated with each cuisine, allowing us to gain insights into the key components and flavors that define different culinary traditions. This visualization method not only presents the ingredients aesthetically pleasingly but also sparks ideas about the unusual and noteworthy ingredients used in each cuisine.

Cuisine Classification

During the data collection process, we created two important components: the target variable and the text column, which serves as a crucial feature for our machine learning model. However, in order for the machine to effectively understand and process the text column, we need to convert it into a numeric representation. There are several methods that can be employed for this purpose, and we will outline them before proceeding with the modeling phase.

CountVectorizer— TF-IDF (Term Frequency-Inverse Document Frequency)

To effectively use the text columns in machine learning algorithms, we need to convert them into numerical vectors. Two common approaches for text vectorization are CountVectorizer and TF-IDF.

CountVectorizer: This method creates a document matrix where every row represents the document, and every column represents one unique word in a corpus. The cells in the matrix typically represent the count of how many times a word appears in a document.

TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF considers both the frequency of words in the document and rarity of words across different documents (inverse document frequency). The resulting document matrix reflects the weighted importance of words in the documents.

To achieve our models' highest accuracy, I applied the CountVectorizer and TF-IDF tokenization methods. Additionally, I utilized n-grams, which consider sequences of words instead of single words, to capture more contextual information from text data.

CountVectorizer—-TF-IDF-(Term-Frequency-Inverse-Document-Frequency)

We experimented with several NLP models; the results are displayed in the chart on the left. The Random Forest model suffered from overfitting, as indicated by the significant difference between the training and test accuracies.

Among the models tested, the Multinomial Naive Bayes performed the best, achieving a test accuracy of 74%. This model utilized TF-IDF transformation without n-grams. Despite further optimizing the model using Grid Search CV to explore various parameter combinations, the accuracy dropped to 0.71.

Therefore, the Multinomial Naive Bayes model with TF-IDF transformation emerged as the most effective in this project, offering satisfactory accuracy.

CountVectorizer—-TF-IDF-(Term-Frequency-Inverse-Document-Frequency)-2

Topic Modelling

Topic modeling is an effective method for grouping documents based on their content. In this project, we utilized Latent Dirichlet Allocation (LDA), a popular technique for topic modeling. By applying LDA to the "Explanations" column, we aimed to understand the different topics related to the dishes.

Following a similar preprocessing approach as mentioned earlier, we tokenized the text and extracted only the nouns and adjectives. Then, we transformed the text into vectors using CountVectorizer and examined the resulting topics.

After evaluating different topic models, we found that the model with three topics yielded the most meaningful results. Here is a brief overview of the identified topics:

Topic 1: Ingredients and Cooking Techniques - This topic focuses on discussions related to various ingredients used in cooking, as well as different cooking methods and techniques employed in preparing the dishes.

Topic 2: Cultural and Regional Influences - This topic revolves around the cultural and regional aspects of different cuisines. It includes discussions about traditional cooking styles, local ingredients, and specific dishes associated with certain regions or cultures.

Topic 3: Flavor Profiles and Seasonings - This topic explores the flavor profiles of dishes, highlighting the use of specific seasonings, spices, and flavors to enhance the taste and aroma of the prepared meals.

By analyzing the topics generated by the LDA model, we can gain insights into the different aspects and themes present in the explanations of the dishes, helping us understand the content more effectively.

Topic 0 is used for Healthy Food

Topic 1 is used for Desserts

Topic 2 is used for Mexican Food

Thanks a lot for reading our post! For more details, you can contact Actowiz Solutions now! Ask us about all your mobile app scraping and web scraping service requirements.