A language model is a mathematical model that describes a human language as a probability distribution over its vocabulary. To train a deep learning network to model a language, you need to identify the vocabulary and learn its probability distribution. You can’t create the model from nothing. You need a dataset for your model to learn from. In this article, you’ll learn about datasets used to train language models and how to source common datasets from public repositories. Let’s get started. Datasets for Training a Language ModelPhoto by Dan V. Some rights reserved. A Good Dataset for Training a Language Model A good language model should learn correct language usage, free of biases and errors. Unlike programming languages, human languages lack formal grammar and syntax. They evolve continuously, making it impossible to catalog all language variations. Therefore, the model should be trained from a dataset instead of crafted from rules. Setting up a dataset for language modeling is challenging. You need a large, diverse dataset that represents the language’s nuances. At the same time, it must be high quality, presenting correct language usage. Ideally, the dataset should be manually edited and cleaned to remove noise like typos, grammatical errors, and non-language content such as symbols or HTML tags. Creating such a dataset from scratch is costly, but several high-quality datasets are freely available. Common datasets include: Common Crawl. A massive, continuously updated dataset of over 9.5 petabytes with diverse content. It’s used by leading models including GPT-3, Llama, and T5. However, since it’s sourced from the web, it contains low-quality and duplicate content, along with biases and offensive material. Rigorous cleaning and filtering are required to make it useful. C4 (Colossal Clean Crawled Corpus). A 750GB dataset scraped from the web. Unlike Common Crawl, this dataset is pre-cleaned and filtered, making it easier to use. Still, expect potential biases and errors. The T5 model was trained on this dataset. Wikipedia. English content alone is around 19GB. It is massive yet manageable. It’s well-curated, structured, and edited to Wikipedia standards. While it covers a broad range of general knowledge with high factual accuracy, its encyclopedic style and tone are very specific. Training on this dataset alone may cause models to overfit to this style. WikiText. A dataset derived from verified good and featured Wikipedia articles. Two versions exist: WikiText-2 (2 million words from hundreds of articles) and WikiText-103 (100 million words from 28,000 articles). BookCorpus. A few-GB dataset of long-form, content-rich, high-quality book texts. Useful for learning coherent storytelling and long-range dependencies. However, it has known copyright issues and social biases. The Pile. An 825GB curated dataset from multiple sources, including BookCorpus. It mixes different text genres (books, articles, source code, and academic papers), providing broad topical coverage designed for multidisciplinary reasoning. However, this diversity results in variable quality, duplicate content, and inconsistent writing styles. Getting the Datasets You can search for these datasets online and download them as compressed files. However, you’ll need to understand each dataset’s format and write custom code to read them. Alternatively, search for datasets in the Hugging Face repository at https://huggingface.co/datasets. This repository provides a Python library that lets you download and read datasets in real time using a standardized format. Hugging Face Datasets Repository Let’s download the WikiText-2 dataset from Hugging Face, one of the smallest datasets suitable for building a language model: import random from datasets import load_dataset dataset = load_dataset(“wikitext”, “wikitext-2-raw-v1″) print(f”Size of the dataset: {len(dataset)}”) # print a few samples n = 5 while n > 0: idx = random.randint(0, len(dataset)-1) text = dataset[idx][“text”].strip() if text and not text.startswith(“=”): print(f”{idx}: {text}”) n -= 1 import random from datasets import load_dataset dataset = load_dataset(“wikitext”, “wikitext-2-raw-v1”) print(f“Size of the dataset: {len(dataset)}”) # print a few samples n = 5 while n > 0: idx = random.randint(0, len(dataset)–1) text = dataset[idx][“text”].strip() if text and not text.startswith(“=”): print(f“{idx}: {text}”) n -= 1 The output may look like this: Size of the dataset: 36718 31776: The Missouri ‘s headwaters above Three Forks extend much farther upstream than … 29504: Regional variants of the word Allah occur in both pagan and Christian pre @-@ … 19866: Pokiri ( English : Rogue ) is a 2006 Indian Telugu @-@ language action film , … 27397: The first flour mill in Minnesota was built in 1823 at Fort Snelling as a … 10523: The music industry took note of Carey ‘s success . She won two awards at the … Size of the dataset: 36718 31776: The Missouri ‘s headwaters above Three Forks extend much farther upstream than … 29504: Regional variants of the word Allah occur in both pagan and Christian pre @-@ … 19866: Pokiri ( English : Rogue ) is a 2006 Indian Telugu @-@ language action film , … 27397: The first flour mill in Minnesota was built in 1823 at Fort Snelling as a … 10523: The music industry took note of Carey ‘s success . She won two awards at the … If you haven’t already, install the Hugging Face datasets library: When you run this code for the first time, load_dataset() downloads the dataset to your local machine. Ensure you have enough disk space, especially for large datasets. By default, datasets are downloaded to ~/.cache/huggingface/datasets. All Hugging Face datasets follow a standard format. The dataset object is an iterable, with each item as a dictionary. For language model training, datasets typically contain text strings. In this dataset, text is stored under the “text” key. The code above samples a few elements from the dataset. You’ll see plain text strings of varying lengths. Post-Processing the Datasets Before training a language model, you may want to post-process the dataset to clean the data. This includes reformatting text (clipping long strings, replacing multiple spaces with single spaces), removing non-language content (HTML tags, symbols), and removing unwanted characters (extra spaces around punctuation). The specific processing depends on the dataset and how you want to present text to the model. For example, if
VIRAL: Apple’s Limited-Edition ‘iPhone Pocket’ Accessory Brutally Mocked Over Rs 20,400 Price Tag | viral News
Apple recently announced the “iPhone Pocket,” a limited-edition accessory developed in cooperation with ISSEY MIYAKE, which quickly set off a storm of social media ridicule regarding its high price and unusual design. The accessory, described by the company as a “singular 3D-knitted construction,” comes in two variants priced between around Rs 13,310 and Rs 20,400. The pricey accessory has been compared to nothing but a plain, cut-up sock by many social media users, or at best, a fabric crossbody bag. Social Media Backlash/Memes Add Zee News as a Preferred Source The prices, varying from Rs 13,310 for a short strap to Rs 20,400 for a long strap, were the source of much ridicule online. One user exclaimed, “The iPhone pocket everybody. Rs 20,400 for a cut-up sock. Apple people will pay anything for anything as long as it’s Apple.” Another user shared a picture and wrote, “This sock or ‘iPhone Pocket’ is for Rs 20,400,” while another user commented wryly on the price: “iPhone Pocket: Rs 13,310. Your gym sock: Rs 12.” TWO hundred and thirty dollars This feels like a litmus test for people who will buy/defend anything Apple releases pic.twitter.com/hSAaJXGAOn — Marques Brownlee (@MKBHD) November 11, 2025 The reaction was everywhere, with many memes popping up; many featured pictures of the iPhone Pocket photoshopped to include googly eyes. Design And Pricing Details The iPhone Pocket, in collaboration with ISSEY MIYAKE, takes its inspiration from “the concept of a piece of cloth” and is crafted in Japan. Construction: The accessory is described as a “singular 3D-knitted construction designed to fit any iPhone.” Versions and Pricing: Short Strap: Costs Rs 13,310 (approx.), comes in eight colours including lemon, mandarin, purple, and black. Long Strap: Rs 20,400, it is available in three colours: sapphire, cinnamon, and black. Design Philosophy: Yoshiyuki Miyamae, design director of MIYAKE DESIGN STUDIO, described the concept: “iPhone Pocket explores the concept of “the joy of wearing iPhone in your own way.” The simplicity of its design echoes what we practice at ISSEY MIYAKE — the idea of leaving things less defined to allow for possibilities and personal interpretation.” Availability The special-edition iPhone Pocket will be available to purchase from Friday, November 14, in select Apple Store locations in the USA, and globally on apple.com in France, Greater China, Italy, Japan, Singapore, South Korea, the UK, and the US. ALSO READ | WATCH: China’s Newly Opened Hongqi Bridge Collapses In Massive Landslide; Viral Video
The Complete Guide to Model Context Protocol
In this article, you will learn what the Model Context Protocol (MCP) is, why it exists, and how it standardizes connecting language models to external data and tools. Topics we will cover include: The integration problem MCP is designed to solve. MCP’s client–server architecture and communication model. The core primitives (resources, prompts, and tools) and how they work together. Let’s not waste any more time. The Complete Guide to Model Context ProtocolImage by Editor Introducing Model Context Protocol Language models can generate text and reason impressively, yet they remain isolated by default. Out of the box, they can’t access your files, query databases, or call APIs without additional integration work. Each new data source means more custom code, more maintenance burden, and more fragmentation. Model Context Protocol (MCP) solves this by providing an open-source standard for connecting language models to external systems. Instead of building one-off integrations for every data source, MCP provides a shared protocol that lets models communicate with tools, APIs, and data. This article takes a closer look at what MCP is, why it matters, and how it changes the way we connect language models to real-world systems. Here’s what we’ll cover: The core problem MCP is designed to solve An overview of MCP’s architecture The three core primitives: tools, prompts, and resources How the protocol flow works in practice When to use MCP (and when not to) By the end, you’ll have a solid understanding of how MCP fits into the modern AI stack and how to decide if it’s right for your projects. The Problem That Model Context Protocol Solves Before MCP, integrating AI into enterprise systems was messy and inefficient because tying language models to real systems quickly runs into a scalability problem. Each new model and each new data source need custom integration code — connectors, adapters, and API bridges — that don’t generalize. If you have M models and N data sources, you end up maintaining M × N unique integrations. Every new model or data source multiplies the complexity, adding more maintenance overhead. The MCP solves this by introducing a shared standard for communication between models and external resources. Instead of each model integrating directly with every data source, both models and resources speak a common protocol. This turns an M × N problem into an M + N one. Each model implements MCP once, each resource implements MCP once, and everything can interoperate smoothly. From M × N integrations to M + N with MCPImage by Author In short, MCP decouples language models from the specifics of external integrations. In doing so, it enables scalable, maintainable, and reusable connections that link AI systems to real-world data and functionality. Understanding MCP’s Architecture MCP implements a client-server architecture with specific terminology that’s important to understand. The Three Key Components MCP Hosts are applications that want to use MCP capabilities. These are typically LLM applications like Claude Desktop, IDEs with AI features, or custom applications you’ve built. Hosts contain or interface with language models and initiate connections to MCP servers. MCP Clients are the protocol clients created and managed by the host application. When a host wants to connect to an MCP server, it creates a client instance to handle that specific connection. A single host application can maintain multiple clients, each connecting to different servers. The client handles the protocol-level communication, managing requests and responses according to the MCP specification. MCP Servers expose specific capabilities to clients: database access, filesystem operations, API integrations, or computational tools. Servers implement the server side of the protocol, responding to client requests and providing resources, tools, and prompts. MCP ArchitectureImage by Author This architecture provides a clean separation of concerns: Hosts focus on orchestrating AI workflows without concerning themselves with data source specifics Servers expose capabilities without knowing how models will use them The protocol handles communication details transparently A single host can connect to multiple servers simultaneously through separate clients. For example, an AI assistant might maintain connections to filesystem, database, GitHub, and Slack servers concurrently. The host presents the model with a unified capability set, abstracting away whether data comes from local files or remote APIs. Communication Protocol MCP uses JSON-RPC 2.0 for message exchange. This lightweight remote procedure call protocol provides a structured request/response format and is simple to inspect and debug. MCP supports two transport mechanisms: stdio (Standard Input/Output): For local server processes running on the same machine. The host spawns the server process and communicates through its standard streams. HTTP: For networked communication. Uses HTTP POST for requests and, optionally, Server-Sent Events for streaming. This flexibility lets MCP servers run locally or remotely while keeping communication consistent. The Three Core Primitives MCP relies on three core primitives that servers expose. They provide enough structure to enable complex interactions without limiting flexibility. Resources Resources represent any data a model can read. This includes file contents, database records, API responses, live sensor data, or cached computations. Each resource uses a URI scheme, which makes it easy to identify and access different types of data. Here are some examples: Filesystem: file:///home/user/projects/api/README.md Database: postgres://localhost/customers/table/users Weather API: weather://current/san-francisco The URI scheme identifies the resource type. The rest of the path points to the specific data. Resources can be static, such as files with fixed URIs, or dynamic, like the latest entries in a continuously updating log. Servers list available resources through the resources/list endpoint, and hosts retrieve them via resources/read. Each resource includes metadata, such as MIME type, which helps hosts handle content correctly — text/markdown is processed differently than application/json — and descriptions provide context that helps both users and models understand the resource. Prompts Prompts provide reusable templates for common tasks. They encode expert knowledge and simplify complex instructions. For example, a database MCP server can offer prompts like analyze-schema, debug-slow-query, or generate-migration. Each prompt includes the context necessary for the task. Prompts accept arguments. An analyze-table prompt can take a table name and include schema details, indexes, foreign key relationships, and recent
Oppo Reno 15 Series Set To Launch In China On Nov 17: Expected Models, Specs, Features | Technology News
Oppo Reno 15 Series: Oppo has officially announced the launch date of its upcoming Reno 15 smartphone series in China. The lineup, which will include the Reno 15, Reno 15 Pro, and a new Reno 15 Mini, is set to debut on November 17 at 7pm local time (4:30pm IST). The launch will coincide with the brand’s Double Eleven (11.11) shopping festival celebrations in the country. Three Models in Lineup The Reno 15 series will include three models — the standard Reno 15, the Reno 15 Pro, and the smaller Reno 15 Mini. Oppo has already listed the Reno 15 and Reno 15 Pro on its official e-shop, and pre-orders are currently open ahead of the launch event. Add Zee News as a Preferred Source Colour Options and Storage Variants According to the official listing, the Oppo Reno 15 will come in three colour options — Starlight Bow, Aurora Blue, and Canele Brown. It will be available in five RAM and storage options: 12GB + 256GB 12GB + 512GB 16GB + 256GB 16GB + 512GB 16GB + 1TB The Oppo Reno 15 Pro, on the other hand, will be offered in Starlight Bow, Canele Brown, and Honey Gold colour options. This model will have four RAM and storage options: 12GB + 256GB 12GB + 512GB 16GB + 512GB 16GB + 1TB (Also Read: GTA 6 Delayed Again — Fans Disappointed As Launch Pushed To November 2026) Expected Display Sizes According to reports, the Reno 15 Pro will feature a 6.78-inch 1.5K flat display, while the compact Reno 15 Mini could come with a 6.32-inch 1.5K screen. The standard Reno 15 is expected to sit between the two, with a 6.59-inch display. Camera Specifications The Reno 15 Pro and Reno 15 Mini are rumoured to feature triple rear camera setups. Both models may include a 200-megapixel Samsung ISOCELL HP5 primary sensor, a 50-megapixel ultrawide camera, and a 50-megapixel periscope lens. On the front, all models are expected to sport 50-megapixel selfie cameras for high-quality front photography. The Oppo Reno 15 series launch event will take place on November 17, and the devices are already listed for pre-order in China.
Apple iOS 26.1 Is Here: Major Fixes, New Features And More – Check What’s News | Technology News
Apple iOS 26.1 Details: Apple has rolled out iOS 26.1, the first major update since the launch of iOS 26 in September. The new update focuses on enhancing the overall experience with small yet meaningful upgrades, design improvements, and more intuitive controls. It’s now available for all iOS 26-compatible iPhones and aims to address several issues users have pointed out, particularly around the Liquid Glass interface and the Lock Screen camera shortcut. Major New Features And Fixes One of the biggest highlights is the new Liquid Glass Transparency Toggle. There’s a new option under Display and Brightness settings to switch between “Clear” and “Tinted” modes. The tinted option adds more opacity and contrast, making menus and buttons easier to read. It’s a much-needed fix, as many users complained about poor visibility in iOS 26. Add Zee News as a Preferred Source Another useful change is the ability to turn off the Lock Screen camera gesture. This feature stops the camera from accidentally opening when the phone is in your pocket or bag. You can disable it in the Camera settings without turning off the camera entirely. Apple has also added a “Slide to Stop” gesture for alarms and timers. The update adds more language support for Apple Intelligence, which now understands Danish, Dutch, Turkish and Vietnamese. AirPods Live Translation also supports new languages like Japanese, Korean, and Chinese, making conversations smoother for AirPods Pro 2, Pro 3 and AirPods 4 users. It gets new gesture-based controls in Apple Music. You can now swipe left or right on the mini-player to skip tracks. The new AutoMix feature also works with AirPlay, allowing seamless transitions even on external speakers. Visually, the interface looks neater. The Settings app and Home Screen folders now have left-aligned headers, and the Phone keypad uses the Liquid Glass effect for a modern look. Safari gets a slightly wider tab bar, and the Photos app offers a redesigned video slider and editing tools. In terms of security, Apple has replaced the Rapid Security Response feature with a new automatic background security update toggle. This allows the devices to receive security patches without requiring a full system update.
Families Sue OpenAI Over Alleged Suicides, Psychological Harm Linked To ChatGPT: Report | Technology News
ChatGPT maker OpenAI is facing several new lawsuits from families who say the company released its GPT-4o model too early. They claim the model may have contributed to suicides and mental health problems, according to reports. OpenAI, based in the US, launched GPT-4o in May 2024, making it the default model for all users. In August, it introduced GPT-5 as its next version. According to TechCrunch, the model reportedly had issues with being “too agreeable” or “overly supportive,” even when users expressed harmful thoughts. The report said that four lawsuits blame ChatGPT for its alleged role in family members’ suicides, while three others claim the chatbot encouraged harmful delusions that led some people to require psychiatric treatment. Add Zee News as a Preferred Source According to the report, the lawsuits also claim that OpenAI rushed safety testing to beat Google’s Gemini to market. OpenAI has yet to comment on the report. Recent legal filings allege that ChatGPT can encourage suicidal people to act on their plans and inspire dangerous delusions. “OpenAI recently released data stating that over one million people talk to ChatGPT about suicide weekly,” the report mentioned. (Also Read: ChatGPT Go Now Free In India For One Year: OpenAI Launches Special Offer Starting November 4- Check Details) In a recent blog post, OpenAI said it worked with more than 170 mental health experts to help ChatGPT more reliably recognize signs of distress, respond with care, and guide people toward real-world support—reducing responses that fall short of its desired behavior by 65–80 percent. “We believe ChatGPT can provide a supportive space for people to process what they’re feeling and guide them to reach out to friends, family, or a mental health professional when appropriate,” it noted. “Going forward, in addition to our longstanding baseline safety metrics for suicide and self-harm, we are adding emotional reliance and non-suicidal mental health emergencies to our standard set of baseline safety testing for future model releases,” OpenAI added. (With inputs of IANS).
How Much YouTube Pays For Per 1,000 Views? Revenue On YouTube Earning Calculator Will Leave You Shocked | Technology News
YouTube has grown into one of the world’s largest and most profitable platforms for digital creators, offering people the chance to turn their creativity into a full-time career. Every day, millions of videos are uploaded across categories like entertainment, technology, education, gaming, and lifestyle. With such massive reach, YouTube has become a key source of income for influencers, vloggers, and businesses. However, understanding how much YouTube pays for videos or views depends on various factors. Many new YouTubers often wonder how much the platform actually pays per 1,000 views, as earnings can vary widely. The amount depends on factors like video content type, viewer location, ad engagement, and the overall demand from advertisers within that niche. How YouTube Earnings Work Add Zee News as a Preferred Source YouTube pays creators through its YouTube Partner Program (YPP). To join the program, a channel must have at least 1,000 subscribers and 4,000 valid watch hours in the past 12 months. Once approved, creators can start earning money through ads that appear on their videos. The payment is calculated based on CPM (Cost Per Mille), which means the amount advertisers pay per 1,000 ad impressions. However, creators don’t receive the full CPM amount, YouTube keeps about 45% of the ad revenue, while the remaining 55% goes to the creator. (Also Read: GTA 6 Delayed Again — Fans Disappointed As Launch Pushed To November 2026) Average YouTube Pay per 1,000 Views The amount YouTube pays per 1,000 views varies widely depending on several factors such as country, content type, audience demographics, and engagement. On average, creators can earn between $0.50 and $5 per 1,000 views. Entertainment and Vlogs: $0.50 – $2 per 1,000 views Tech and Gadgets: $2 – $4 per 1,000 views Finance and Business: $5 – $10 per 1,000 views Education and Tutorials: $1 – $4 per 1,000 views Channels focusing on financial advice, business tips, or digital marketing earn more because advertisers in those categories pay higher rates. In contrast, general entertainment channels usually have lower ad rates due to broad audiences and less targeted ads. YouTube Earning Calculator A YouTube Earning Calculator is an online tool that helps estimate how much a creator might earn from their videos. Users simply enter the number of views, estimated CPM, and engagement rate to get an approximate earning figure. For example, if a channel gets 100,000 views with a CPM of $3, the total revenue would be around $300 before YouTube’s share. After YouTube takes its 45% cut, the creator would earn approximately $165. While this tool gives a helpful estimate, the actual amount can differ depending on ad availability, viewer location, and the percentage of viewers who watch ads instead of skipping them. (Also Read: GTA 6 Trailer Release: Ahead Of Much-Hyped Launch, YouTube Tightens Violent Game Rules – All You Need To Know) Other Ways Creators Earn on YouTube Apart from ad revenue, many creators earn money through: Channel memberships Super Chat and Super Stickers (during live streams) Brand sponsorships and collaborations Affiliate marketing Merchandise sales YouTube does not pay a fixed amount for per 1,000 views. The earnings depend on the content category, viewer engagement, and location. Using a YouTube Earning Calculator can help estimate potential income, but real earnings vary from channel to channel. For creators, focusing on quality content and building an engaged audience generate more revenue compared to others.
iPhone 17e, iPhone 18 And More: Apple Is Likely To Launch THESE Products Next Year | Technology News
Apple 2026 Expected Product Lineup: Apple is preparing for one of its busiest years ever in 2026. According to media reports, the company is expected to launch at least 15 new products across its popular device lineup next year. This includes new iPhones, iPads, Macs, Apple Watches and even smart home gadgets. Apple will reportedly introduce a new iPhone 17e, a more affordable model in the iPhone 17 family. Additionally, Apple is expected to launch the 12th-generation iPad powered by the A18 chip and a new iPad Air running on the M4 chip. Both models are expected to bring faster performance and better battery efficiency. Mac fans also have plenty to look forward to. According to reports, Apple is planning a new MacBook Air with the M5 chip, while the MacBook Pro lineup will feature the more powerful M5 Pro and M5 Max versions. The company may also launch new external displays, continuing to expand its professional-grade screen lineup. Add Zee News as a Preferred Source Around March or April 2026, Apple is expected to roll out a revamped Siri with AI-powered upgrades. Later in the year, Apple may launch the Apple Watch Series 12 and the iPhone 18 series. The iPhone 18 Pro models are expected to use Apple’s new C1 modem, marking a shift away from Qualcomm chips. There are also growing rumours about Apple’s first foldable iPhone. The reports suggest that Apple is also planning to refresh several other devices, including smart home security products, a Mac mini with M5 chip, an updated Mac Studio, and an iPad mini with an OLED display.
GTA 6 Delayed Again — Fans Disappointed As Launch Pushed To November 2026 | Technology News
Gta 6 Release Date: Rockstar Games has officially confirmed that Grand Theft Auto 6 (GTA 6) will not be arriving as early as fans had hoped. The highly anticipated open-world game has been delayed by six months, with its new release date set for November 19, 2026. Originally scheduled to launch on May 26, 2026, GTA 6’s delay had already been the subject of online speculation and leaks. Rockstar made the announcement early Friday, confirming what many gamers had feared — another setback in the wait for one of the most anticipated titles in gaming history. In a statement, Rockstar Games apologised to fans for the delay and explained that the extra development time is needed to ensure the game meets the studio’s high standards. “We are sorry for adding additional time to what we realize has been a long wait, but these extra months will allow us to finish the game with the level of polish you have come to expect and deserve,” the company said. Add Zee News as a Preferred Source Grand Theft Auto VI will now release on Thursday, November 19, 2026. We are sorry for adding additional time to what we realize has been a long wait, but these extra months will allow us to finish the game with the level of polish you have come to expect and… pic.twitter.com/yLX9KIiDzX Rockstar Games (RockstarGames) November 6, 2025 GTA 6, the next major entry in the blockbuster franchise, will be released on PlayStation 5 and Xbox Series X|S consoles. While Rockstar has not confirmed a PC release yet, reports suggest it could arrive several months after the console version, similar to past releases. (Also Read: iPhone 17e, iPhone 18 And More: Apple Is Likely To Launch THESE Products Next Year) The game is expected to feature a massive open world inspired by a fictional version of Miami (Vice City) and will reportedly include two main protagonists — a male and a female character, a first for the series. Despite the disappointment among fans, many users on the internet believe that Rockstar’s decision to delay the game could lead to a more polished and immersive experience. After all, the company’s previous titles like GTA V and Red Dead Redemption 2 were both delayed before release, which became some of the most successful games ever made.
7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings
7 Advanced Feature Engineering Tricks for Text Data Using LLM EmbeddingsImage by Editor Introduction Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings. These embeddings are useful for incorporating additional information into traditional predictive machine learning models—such as those used in scikit-learn—to improve downstream performance. This article presents seven advanced Python examples of feature engineering tricks that add extra value to text data by leveraging LLM-generated embeddings, thereby enhancing the accuracy and robustness of downstream machine learning models that rely on text, in applications such as sentiment analysis, topic classification, document clustering, and semantic similarity detection. Common setup for all examples Unless stated otherwise, the seven example tricks below make use of this common setup. We rely on Sentence Transformers for embeddings and scikit-learn for modeling utilities. !pip install sentence-transformers scikit-learn -q from sentence_transformers import SentenceTransformer import numpy as np # Load a lightweight LLM embedding model; builds 384-dimensional embeddings model = SentenceTransformer(“all-MiniLM-L6-v2”) !pip install sentence–transformers scikit–learn –q from sentence_transformers import SentenceTransformer import numpy as np # Load a lightweight LLM embedding model; builds 384-dimensional embeddings model = SentenceTransformer(“all-MiniLM-L6-v2”) 1. Combining TF-IDF and Embedding Features The first example shows how to jointly extract—given a source text dataset like fetch_20newsgroups—both TF-IDF and LLM-generated sentence-embedding features. We then combine these feature types to train a logistic regression model that classifies news texts based on the combined features, often boosting accuracy by capturing both lexical and semantic information. from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Loading data data = fetch_20newsgroups(subset=”train”, categories=[‘sci.space’, ‘rec.autos’]) texts, y = data.data[:500], data.target[:500] # Extracting features of two broad types tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray() emb = model.encode(texts, show_progress_bar=False) # Combining features and training ML model X = np.hstack([tfidf, StandardScaler().fit_transform(emb)]) clf = LogisticRegression(max_iter=1000).fit(X, y) print(“Accuracy:”, clf.score(X, y)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Loading data data = fetch_20newsgroups(subset=‘train’, categories=[‘sci.space’, ‘rec.autos’]) texts, y = data.data[:500], data.target[:500] # Extracting features of two broad types tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray() emb = model.encode(texts, show_progress_bar=False) # Combining features and training ML model X = np.hstack([tfidf, StandardScaler().fit_transform(emb)]) clf = LogisticRegression(max_iter=1000).fit(X, y) print(“Accuracy:”, clf.score(X, y)) 2. Topic-Aware Embedding Clusters This trick takes a few sample text sequences, generates embeddings using the preloaded language model, applies K-Means clustering on these embeddings to assign topics, and then combines the embeddings with a one-hot encoding of each example’s cluster identifier (its “topic class”) to build a new feature representation. It is a useful strategy for creating compact topic meta-features. from sklearn.cluster import KMeans from sklearn.preprocessing import OneHotEncoder texts = [“Tokyo Tower is a popular landmark.”, “Sushi is a traditional Japanese dish.”, “Mount Fuji is a famous volcano in Japan.”, “Cherry blossoms bloom in the spring in Japan.”] emb = model.encode(texts) topics = KMeans(n_clusters=2, n_init=”auto”, random_state=42).fit_predict(emb) topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(-1, 1)) X = np.hstack([emb, topic_ohe]) print(X.shape) from sklearn.cluster import KMeans from sklearn.preprocessing import OneHotEncoder texts = [“Tokyo Tower is a popular landmark.”, “Sushi is a traditional Japanese dish.”, “Mount Fuji is a famous volcano in Japan.”, “Cherry blossoms bloom in the spring in Japan.”] emb = model.encode(texts) topics = KMeans(n_clusters=2, n_init=‘auto’, random_state=42).fit_predict(emb) topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(–1, 1)) X = np.hstack([emb, topic_ohe]) print(X.shape) 3. Semantic Anchor Similarity Features This simple strategy computes similarity to a small set of fixed “anchor” (or reference) sentences used as compact semantic descriptors—essentially, semantic landmarks. Each column in the similarity-feature matrix contains the similarity of the text to one anchor. The main value lies in allowing the model to learn relationships between the text’s similarity to key concepts and a target variable—useful for text classification models. from sklearn.metrics.pairwise import cosine_similarity anchors = [“space mission”, “car performance”, “politics”] anchor_emb = model.encode(anchors) texts = [“The rocket launch was successful.”, “The car handled well on the track.”] emb = model.encode(texts) sim_features = cosine_similarity(emb, anchor_emb) print(sim_features) from sklearn.metrics.pairwise import cosine_similarity anchors = [“space mission”, “car performance”, “politics”] anchor_emb = model.encode(anchors) texts = [“The rocket launch was successful.”, “The car handled well on the track.”] emb = model.encode(texts) sim_features = cosine_similarity(emb, anchor_emb) print(sim_features) 4. Meta-Feature Stacking via Auxiliary Sentiment Classifier For text associated with labels such as sentiments, the following feature-engineering technique adds extra value. A meta-feature is built as the prediction probability returned by an auxiliary classifier trained on the embeddings. This meta-feature is stacked with the original embeddings, resulting in an augmented feature set that can improve downstream performance by exposing potentially more discriminative information than raw embeddings alone. A slight additional setup is needed for this example: !pip install sentence-transformers scikit-learn -q from sentence_transformers import SentenceTransformer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Import StandardScaler import numpy as np embedder = SentenceTransformer(“all-MiniLM-L6-v2”) # 384-dim # Small dataset containing texts and sentiment labels texts = [“I love this!”, “This is terrible.”, “Amazing quality.”, “Not good at all.”] y = np.array([1, 0, 1, 0]) # Obtain embeddings from the embedder LLM emb = embedder.encode(texts, show_progress_bar=False) # Train an auxiliary classifier on embeddings X_train, X_test, y_train, y_test = train_test_split( emb, y, test_size=0.5, random_state=42, stratify=y ) meta_clf = LogisticRegression(max_iter=1000).fit(X_train, y_train) # Leverage the auxiliary model’s predicted probability as a meta-feature meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(-1, 1) # Prob of positive class # Augment original embeddings with the meta-feature # Do not forget to scale again for consistency scaler = StandardScaler() emb_scaled = scaler.fit_transform(emb) X_aug = np.hstack([emb_scaled, meta_feature]) # Stack features together print(“emb shape:”, emb.shape) print(“meta_feature shape:”, meta_feature.shape) print(“augmented shape:”, X_aug.shape) print(“meta clf accuracy on test slice:”, meta_clf.score(X_test, y_test)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33