Topic outline

  • Introduction

    Module Lecturer

    Mafas Raheem

    Data Scientist | Business Analyst | Senior Lecturer

    I am an academic/trainer/researcher specializing in the field of Data Science & Business Analytics with nearly 17 years of academic & industry experience. I hold an MSc in Data Science & Business Analytics and a Master of Business Administration degree and currently reading my PhD in the area of machine learning (Natural Language Processing) at the Asia Pacific University of Innovation and Technology, Malaysia. I have published a significant number of indexed journal articles in the area of Machine Learning and Data Science matching the current business needs.

    I am actively involved in consulting data analytics/machine learning projects for the business/retail domains. I have been involved in numerous data mining projects in Malaysia, and overseas. My knowledge in statistics along with my data mining/machine learning expertise always adds value in solving the contemporary business problems faced by SMEs in the area of market expansion. Also, I conduct training for data analysts and data science professionals in the area of machine learning, data storytelling and business analysis.

    logl     LinkedIn     

         Google Scholar     

    Email: raheem@apu.edu.my

    Email Subject: CT052-3-M-ODL-NLP– your intake – your name – subject/request title
    Use only your APU official Email for correspondence.

    Consultation:

    Refer to “Staff Consultation Hour” on APU Apspace to book appointments.


    Module Synopsis
    The module discusses various models and techniques in current NLP practices. The module covers a broad range of topics in natural language processing, including word and sentence tokenization, text classification and sentiment analysis, spelling correction, information extraction, parsing. Further, it also introduces the underlying theory from probability, statistics, and machine learning that are crucial for the field, and cover fundamental algorithms like n-gram language modelling, naive bayes and maxent classifiers. The specified theories and concepts will be delivered using relevant natural language processing libraries such as NLTK, textblob, VADER, langdetect and translate along with Scikit-Learn to handle machine learning algorithms and related operations.


    Course Learning Outcomes (CLO)
    At the end of the course the students will be able to:

    CLO1     Demonstrate candidate natural language processing techniques for a problem in a specific domain (A3, PLO6)
    CLO2     Formulate text processing techniques for a real-world application (C6, PLO2)
    CLO3     Defend a proposed natural language processing system for a chosen problem (A4, PLO10)


    Course Outline
    image

















    Assessments
    In-course Assessment - 100%
    1. Report - 60%
    2. Demo Presentation - 40%

    References

    Recommended References
    These text books are available in APU eLibrary
    1. Campesato, O. (2020). Python 3 for machine learning. Mercury Learning & Information. ISBN-13: 9781683924937
    2. Liu, Z., Lin, Y., & Sun, M. (2020) Representation Learning for Natural Language Processing. Springer Singapore. ISBN-13: 9789811555732
    3. Patrick, et. al., (2020) Natural Language Processing. SAGE Publications Ltd. ISBN-13: 9781529749120
    4. Lane, H., Hapke, H., Howard, C. (2018). Natural Language Processing in Action: Understanding, analyzing, and generating text with Python. 1st ed. Manning Publications. ISBN-13: 978-1617294631
    5. Bird, S., Klein, E., Loper. E. (2017). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. 2nd ed. O'Reilly Media. ISBN-13: 978-0596516499



  • Introduction to NLP


    Natural language processing (NLP) is a branch of Artificial Intelligence or AI of computer science concerned with giving computers the ability to understand text and spoken words in much the same way human beings can. NLP combines computational linguistics, rule-based modeling of human language with statistical, machine learning, and deep learning models.

    Learning Outcomes:

    1. Explain the specific areas of NLP
    2. Explain the difficulties/challenges of NLP

  • Lexical Analysis


    Regular Expression is a sequence of characters that specifies a match pattern in textWord tokenization is the process of splitting a large sample of text into words. Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma

    Learning Outcomes:

    1. Explain the functionalities of Regular Expression.
    2. Explain the process of tokenization.
    3. Explain stemming and lemmatization.


  • Part-of-speech (POS)


    Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each word in a text is labeled with its corresponding part of speech. This can include nouns, verbs, adjectives, and other grammatical categories.

    Learning Outcomes:

    1. Explain POS tagging in NLP.

  • Parsing


    Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

    Learning Outcomes:

    1. Explain Parsing in NLP

  • Edit Distance


    In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

    Learning Outcome:

    1. Explain Edit Distance and its process.
    2. Demonstrate Edit Distance

  • Language modeling (LM)


    Language modeling (LM) is the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models analyze bodies of text data to provide a basis for their word predictions.

    Learning Outcomes:

    1. Explain the concept of Language modeling (LM).
    2. Demonstrate Language modeling (LM) using NLTK.

  • Text Classification


    Text classification is a machine learning technique that assigns a set of predefined categories to open-ended text. Text classifiers can be used to organize, structure, and categorize pretty much any kind of text.

    Learning Outcomes:

    1. Explain the task of Text Classification.
    2. Explain text classification using Naive Bayes classifier.
    3. Explain the concept of smoothing in Naive Bayes classifier.

  • Sentiment Analysis


    Sentiment analysis, also referred to as opinion mining, is an approach to natural language processing (NLP) that identifies the emotional tone behind a body of text. This is a popular way for organizations to determine and categorize opinions about a product, service or idea.

    Learning Outcomes:

    1. Explain the concept of sentiment analysis.
    2. Perform sentiment analysis using suitable machine learning algorithms

  • Entropy Classifiers


    Maximum entropy (maxent) classifier has been a popular text classifier, by parameterizing the model to achieve maximum categorical entropy, with the constraint that the resulting probability on the training data with the model being equal to the real distribution.

    Learning Outcomes:

    1. Explain the concepts of Maximum entropy (maxent) classifier.
    2. Explain different types of entropy classifiers.

  • Information Retrieval

    Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

    Learning Outcome:

    1. Explain the concept of Information retrieval

  • Relation Extraction

    Relation Extraction is the task of predicting attributes and relations for entities in a sentence.

    Learning Outcome:

    1. Explain the concept of Relation Extraction