The Ministry of Electronics and Information Technology (MeitY) issued a warning to the Elon Musk-owned social media platform X (formerly Twitter) to take immediate steps to disable all “obscene, nude, indecent, and sexually explicit” content created by the AI-powered tool named Grok on the platform. The government has given the platform an ultimatum of 72 hours to submit its Action Taken Report (ATR), failing which the platform will face severe legal action and will also lose its “safe harbour” protection under Section 79 of the Information Technology Act. Grok Artificial Intelligence Misused To Target Women Add Zee News as a Preferred Source This is due to the misuse of the Grok AI facility, reported by the Centre. A four-page letter was written to the Chief Compliance Officer of X, indicating that users are creating fake accounts to host derogatory images and videos of women. The letter pointed out that it is noticed that this tool is being forced to “minimise clothing” and thereby sexualise women in the picture, which actually constitutes a grave disregard for their dignity and privacy. I would take this opportunity to thank Hon IT Minister for promptly taking note of my letter and for issuing a letter to X platform in the regard of AI led grok generating problematic content of women based on prompts that disrespect woman’s dignity and violates their consent,… pic.twitter.com/kEb1HameMn — Priyanka Chaturvedi (@priyankac19) January 2, 2026 Statutory Lapses And Legal Warnings The ministry pointed out the failure of X to abide by its statutory duty of due diligence as prescribed in the Information Technology Act, 2000, and the IT Rules, 2021. Safe Harbour in Jeopardy: Non-compliance could lead to X losing its safe harbour status, which would put it at risk to be held responsible for all third-party content that is currently published on its service. Broad Legal Action: The government issued a warning of legal action in relation to the Bharatiya Nyaya Sanhita (BNS), the Indecent Representation of Women Act, and the POCSO Act (in cases of children). Holistic Security Review Required Besides the initial removal, the Centre has also asked X to conduct an urgent technical and governance-based review of Grok. These include: Examination of Grok’s architecture design, including its structure Prompt Filtering: Enhancing protections against the production of offensive or illegal synthetic media. Accountability: Disciplinary action, like permanent suspension, against violators who misuse the AI tool. Evidence Preservation: Blocking access to illegal material without “vitiating the evidence” that might be relevant to possible criminal proceedings. Wider Crackdown On Digital Obscenity This action has been preceded by another letter, this time from an MP in the Rajya Sabha named Priyanka Chaturvedi, to IT Minister Ashwini Vaishnaw, pointing out the growing trend of “digital undressing” practices against women on this platform. VIDEO | Mumbai: Shiv Sena (UBT) MP Priyanka Chaturvedi says, “There is an AI tool on the platform X, previously known as Twitter, which is being misused. When women share photographs on social media, especially on X, people are prompting this AI tool to digitally disrobe them,… pic.twitter.com/lS637WSSr9 — Press Trust of India (@PTI_News) January 2, 2026 IT Minister Vaishwani again emphasised on Friday that social networking sites bear responsibility for content on their platforms and that “intervention” was also necessary to provide all users with a trustworthy internet. ALSO READ | Mexico Earthquake Today: 6.5 Magnitude Quake Hits Guerrero; President Sheinbaum Evacuates Briefing | SHOCKING VIDEOS
Train Your Large Model on Multiple GPUs with Fully Sharded Data Parallelism
import dataclasses import functools import os import datasets import tokenizers import torch import torch.distributed as dist import torch.nn as nn import torch.nn.functional as F import torch.optim.lr_scheduler as lr_scheduler import tqdm from torch import Tensor from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import ( apply_activation_checkpointing, checkpoint_wrapper, ) from torch.distributed.checkpoint import load, save from torch.distributed.checkpoint.state_dict import ( StateDictOptions, get_state_dict, set_state_dict, ) from torch.distributed.fsdp import ( CPUOffloadPolicy, FSDPModule, MixedPrecisionPolicy, fully_shard, ) from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy from torch.utils.data.distributed import DistributedSampler # Build the model @dataclasses.dataclass class LlamaConfig: “”“Define Llama model hyperparameters.”“” vocab_size: int = 50000 # Size of the tokenizer vocabulary max_position_embeddings: int = 2048 # Maximum sequence length hidden_size: int = 768 # Dimension of hidden layers intermediate_size: int = 4*768 # Dimension of MLP’s hidden layer num_hidden_layers: int = 12 # Number of transformer layers num_attention_heads: int = 12 # Number of attention heads num_key_value_heads: int = 3 # Number of key-value heads for GQA class RotaryPositionEncoding(nn.Module): “”“Rotary position encoding.”“” def __init__(self, dim: int, max_position_embeddings: int) -> None: “”“Initialize the RotaryPositionEncoding module. Args: dim: The hidden dimension of the input tensor to which RoPE is applied max_position_embeddings: The maximum sequence length of the input tensor ““” super().__init__() self.dim = dim self.max_position_embeddings = max_position_embeddings # compute a matrix of n\theta_i N = 10_000.0 inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2) / dim)) inv_freq = torch.cat((inv_freq, inv_freq), dim=–1) position = torch.arange(max_position_embeddings) sinusoid_inp = torch.outer(position, inv_freq) # save cosine and sine matrices as buffers, not parameters self.register_buffer(“cos”, sinusoid_inp.cos()) self.register_buffer(“sin”, sinusoid_inp.sin()) def forward(self, x: Tensor) -> Tensor: “”“Apply RoPE to tensor x. Args: x: Input tensor of shape (batch_size, seq_length, num_heads, head_dim) Returns: Output tensor of shape (batch_size, seq_length, num_heads, head_dim) ““” batch_size, seq_len, num_heads, head_dim = x.shape device = x.device dtype = x.dtype # transform the cosine and sine matrices to 4D tensor and the same dtype as x cos = self.cos.to(device, dtype)[:seq_len].view(1, seq_len, 1, –1) sin = self.sin.to(device, dtype)[:seq_len].view(1, seq_len, 1, –1) # apply RoPE to x x1, x2 = x.chunk(2, dim=–1) rotated = torch.cat((–x2, x1), dim=–1) output = (x * cos) + (rotated * sin) return output class LlamaAttention(nn.Module): “”“Grouped-query attention with rotary embeddings.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.hidden_size = config.hidden_size self.num_heads = config.num_attention_heads self.head_dim = self.hidden_size // self.num_heads self.num_kv_heads = config.num_key_value_heads # GQA: H_kv < H_q # hidden_size must be divisible by num_heads assert (self.head_dim * self.num_heads) == self.hidden_size # Linear layers for Q, K, V projections self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False) self.k_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False) self.v_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False) self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) def reset_parameters(self): self.q_proj.reset_parameters() self.k_proj.reset_parameters() self.v_proj.reset_parameters() self.o_proj.reset_parameters() def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding, attn_mask: Tensor) -> Tensor: bs, seq_len, dim = hidden_states.size() # Project inputs to Q, K, V query_states = self.q_proj(hidden_states).view(bs, seq_len, self.num_heads, self.head_dim) key_states = self.k_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim) value_states = self.v_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim) # Apply rotary position embeddings query_states = rope(query_states) key_states = rope(key_states) # Transpose tensors from BSHD to BHSD dimension for scaled_dot_product_attention query_states = query_states.transpose(1, 2) key_states = key_states.transpose(1, 2) value_states = value_states.transpose(1, 2) # Use PyTorch’s optimized attention implementation # setting is_causal=True is incompatible with setting explicit attention mask attn_output = F.scaled_dot_product_attention( query_states, key_states, value_states, attn_mask=attn_mask, dropout_p=0.0, enable_gqa=True, ) # Transpose output tensor from BHSD to BSHD dimension, reshape to 3D, and then project output attn_output = attn_output.transpose(1, 2).reshape(bs, seq_len, self.hidden_size) attn_output = self.o_proj(attn_output) return attn_output class LlamaMLP(nn.Module): “”“Feed-forward network with SwiGLU activation.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() # Two parallel projections for SwiGLU self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False) self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False) self.act_fn = F.silu # SwiGLU activation function # Project back to hidden size self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False) def reset_parameters(self): self.gate_proj.reset_parameters() self.up_proj.reset_parameters() self.down_proj.reset_parameters() def forward(self, x: Tensor) -> Tensor: # SwiGLU activation: multiply gate and up-projected inputs gate = self.act_fn(self.gate_proj(x)) up = self.up_proj(x) return self.down_proj(gate * up) class LlamaDecoderLayer(nn.Module): “”“Single transformer layer for a Llama model.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=1e–5) self.self_attn = LlamaAttention(config) self.post_attention_layernorm = nn.RMSNorm(config.hidden_size, eps=1e–5) self.mlp = LlamaMLP(config) def reset_parameters(self): self.input_layernorm.reset_parameters() self.self_attn.reset_parameters() self.post_attention_layernorm.reset_parameters() self.mlp.reset_parameters() def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding, attn_mask: Tensor) -> Tensor: # First residual block: Self-attention residual = hidden_states hidden_states = self.input_layernorm(hidden_states) attn_outputs = self.self_attn(hidden_states, rope=rope, attn_mask=attn_mask) hidden_states = attn_outputs + residual # Second residual block: MLP residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) hidden_states = self.mlp(hidden_states) + residual return hidden_states class LlamaModel(nn.Module): “”“The full Llama model without any pretraining heads.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.rotary_emb = RotaryPositionEncoding( config.hidden_size // config.num_attention_heads, config.max_position_embeddings, ) self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) self.layers = nn.ModuleList([ LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers) ]) self.norm = nn.RMSNorm(config.hidden_size, eps=1e–5) def reset_parameters(self): self.embed_tokens.reset_parameters() for layer in self.layers: layer.reset_parameters() self.norm.reset_parameters() def forward(self, input_ids: Tensor, attn_mask: Tensor) -> Tensor: # Convert input token IDs to embeddings hidden_states = self.embed_tokens(input_ids) # Process through all transformer layers, then the final norm layer for layer in self.layers: hidden_states = layer(hidden_states, rope=self.rotary_emb, attn_mask=attn_mask) hidden_states = self.norm(hidden_states) # Return the final hidden states return hidden_states class LlamaForPretraining(nn.Module): def __init__(self, config: LlamaConfig) -> None: super().__init__() self.base_model = LlamaModel(config) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) def reset_parameters(self): self.base_model.reset_parameters() self.lm_head.reset_parameters() def forward(self, input_ids: Tensor, attn_mask: Tensor) -> Tensor: hidden_states = self.base_model(input_ids, attn_mask) return self.lm_head(hidden_states) def create_causal_mask(batch: Tensor, dtype: torch.dtype = torch.float32) -> Tensor: “”“Create a causal mask for self-attention. Args: batch: Batch of sequences, shape (batch_size, seq_len) dtype: Data type of the mask Returns: Causal mask of shape (seq_len, seq_len) ““” batch_size, seq_len = batch.shape mask = torch.full((seq_len, seq_len), float(“-inf”), device=batch.device, dtype=dtype) \ .triu(diagonal=1) return mask def create_padding_mask(batch: Tensor, padding_token_id: int, dtype: torch.dtype = torch.float32) -> Tensor: “”“Create a padding mask for a batch of sequences for self-attention. Args: batch: Batch of sequences, shape (batch_size, seq_len) padding_token_id: ID of the padding token dtype: Data type of the mask Returns: Padding mask of shape (batch_size, 1, seq_len, seq_len) ““” padded
Realme 16 Pro Series India Launch Confirmed: Check Expected Prices, Colours, And Key Specs | Technology News
Realme 16 Pro: The Realme 16 Pro series is set to launch in India in the first week of January 2026, and ahead of the official announcement, details about its pricing and variants have surfaced online. The lineup includes the Realme 16 Pro 5G and the Realme 16 Pro+ 5G, along with the upcoming Realme Pad 3 5G. While Realme has confirmed the launch date, pricing details are yet to be officially announced. Expected Prices and Storage Variants According to information shared by tech blogger Paras Guglani, the Realme 16 Pro 5G is expected to start at Rs 31,999 for the base variant with 8GB RAM and 128GB storage. The 8GB RAM + 256GB storage model will likely cost Rs 33,999, while the top variant with 12GB RAM and 256GB storage may be priced at Rs 36,999. Add Zee News as a Preferred Source For the Realme 16 Pro+ 5G, the leaked pricing suggests a starting price of Rs 39,999 for the 8GB RAM + 128GB storage version. The 8GB RAM + 256GB variant is expected to be priced at Rs 41,999, while the top-end model with 12GB RAM and 256GB storage could cost Rs 44,999. Reports also suggest that buyers may receive additional merchandise when purchasing the phone offline. (Also Read: New Year’s Eve Tech Tip: How You Place Your Smartphone On Table Can Improve Privacy, Focus, Battery, And Mental Peace-Explained) Launch Timeline and Availability The Realme 16 Pro series is scheduled to launch in India on January 6, 2026. The smartphones will be sold through Flipkart and the Realme India online store. They will be available in Master Gold and Master Grey colour options and feature the brand’s new “Urban Wild” design language. Key Specifications Both smartphones in the series are confirmed to feature a large 7,000mAh Titan battery. The camera setup will include a LumaColor Image-powered triple rear camera system, led by a 200-megapixel primary sensor. The Realme 16 Pro+ 5G will be powered by the Snapdragon 7 Gen 4 chipset, while the Realme 16 Pro 5G will use the MediaTek Dimensity 7300-Max processor. As of now, Realme has not officially confirmed the leaked prices, and the final details are expected to be revealed at the official launch event.
OPPO Pad 5 Officially Confirmed To Launch In India Alongside OPPO Reno 15 Series; Check Expected Display, Camera, Price, And Other Specs | Technology News
OPPO Pad 5 Price In India: Chinese smartphone brand OPPO has confirmed that it will soon launch the OPPO Pad 5 in India. The tablet’s India debut was spotted on a Flipkart microsite created for the upcoming OPPO Reno 15 series, where the OPPO Pad 5 is mentioned at the bottom. While OPPO has not officially announced the launch date yet, the listing strongly hints that the tablet will arrive alongside or around the Reno 15 series launch. The OPPO Pad 5 has already been introduced in China, giving users an early idea of what to expect. In India, the Android tablet is likely to be available in Black and Pink colour options, although OPPO has not revealed the official names of these shades so far. OPPO Pad 5 Specifications (Expected) Add Zee News as a Preferred Source The OPPO Pad 5 is expected to sport a large 12.1-inch LCD display with an adaptive 120Hz refresh rate, going up to 144Hz for smoother visuals. The Android tablet could be powered by the MediaTek Dimensity 9400+ chipset, paired with up to 16GB of RAM and 512GB of internal storage for seamless multitasking. The tablet is likely to pack a massive 10,050mAh battery with 67W fast charging support, ensuring longer usage with quicker top-ups. On the software front, the OPPO Pad 5 is expected to run ColorOS 16 based on Android 16. For photography and video calls, it may feature an 8MP camera on both the front and rear. OPPO Pad 5 Price in India (Expected) In China, the OPPO Pad 5 is priced fromCNY 2,599 (around Rs. 32,000) for the base variant, while the top-end model costs nearly Rs. 44,000. If OPPO follows similar pricing in India, the tablet will rival the Samsung Galaxy Tab S10 FE and the Apple iPad.
Train Your Large Model on Multiple GPUs with Tensor Parallelism
import dataclasses import datetime import os import datasets import tokenizers import torch import torch.distributed as dist import torch.nn as nn import torch.nn.functional as F import torch.optim.lr_scheduler as lr_scheduler import tqdm from torch import Tensor from torch.distributed.checkpoint import load, save from torch.distributed.checkpoint.default_planner import DefaultLoadPlanner from torch.distributed.fsdp import FSDPModule, fully_shard from torch.distributed.tensor import Replicate, Shard from torch.distributed.tensor.parallel import ( ColwiseParallel, PrepareModuleInput, RowwiseParallel, SequenceParallel, loss_parallel, parallelize_module, ) from torch.utils.data.distributed import DistributedSampler # Set default to bfloat16 torch.set_default_dtype(torch.bfloat16) print(“NCCL version:”, torch.cuda.nccl.version()) # Build the model @dataclasses.dataclass class LlamaConfig: “”“Define Llama model hyperparameters.”“” vocab_size: int = 50000 # Size of the tokenizer vocabulary max_position_embeddings: int = 2048 # Maximum sequence length hidden_size: int = 768 # Dimension of hidden layers intermediate_size: int = 4*768 # Dimension of MLP’s hidden layer num_hidden_layers: int = 12 # Number of transformer layers num_attention_heads: int = 12 # Number of attention heads num_key_value_heads: int = 3 # Number of key-value heads for GQA class RotaryPositionEncoding(nn.Module): “”“Rotary position encoding.”“” def __init__(self, dim: int, max_position_embeddings: int) -> None: “”“Initialize the RotaryPositionEncoding module. Args: dim: The hidden dimension of the input tensor to which RoPE is applied max_position_embeddings: The maximum sequence length of the input tensor ““” super().__init__() self.dim = dim self.max_position_embeddings = max_position_embeddings # compute a matrix of n\theta_i N = 10_000.0 inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2) / dim)) inv_freq = torch.cat((inv_freq, inv_freq), dim=–1) position = torch.arange(max_position_embeddings) sinusoid_inp = torch.outer(position, inv_freq) # save cosine and sine matrices as buffers, not parameters self.register_buffer(“cos”, sinusoid_inp.cos()) self.register_buffer(“sin”, sinusoid_inp.sin()) def forward(self, x: Tensor) -> Tensor: “”“Apply RoPE to tensor x. Args: x: Input tensor of shape (batch_size, seq_length, num_heads, head_dim) Returns: Output tensor of shape (batch_size, seq_length, num_heads, head_dim) ““” batch_size, seq_len, num_heads, head_dim = x.shape device = x.device dtype = x.dtype # transform the cosine and sine matrices to 4D tensor and the same dtype as x cos = self.cos.to(device, dtype)[:seq_len].view(1, seq_len, 1, –1) sin = self.sin.to(device, dtype)[:seq_len].view(1, seq_len, 1, –1) # apply RoPE to x x1, x2 = x.chunk(2, dim=–1) rotated = torch.cat((–x2, x1), dim=–1) output = (x * cos) + (rotated * sin) return output class LlamaAttention(nn.Module): “”“Grouped-query attention with rotary embeddings.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.hidden_size = config.hidden_size self.num_heads = config.num_attention_heads self.head_dim = self.hidden_size // self.num_heads self.num_kv_heads = config.num_key_value_heads # GQA: H_kv < H_q # hidden_size must be divisible by num_heads assert (self.head_dim * self.num_heads) == self.hidden_size # Linear layers for Q, K, V projections self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False) self.k_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False) self.v_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False) self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding, attn_mask: Tensor) -> Tensor: bs, seq_len, dim = hidden_states.size() # Project inputs to Q, K, V query_states = self.q_proj(hidden_states).view(bs, seq_len, self.num_heads, self.head_dim) key_states = self.k_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim) value_states = self.v_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim) # Apply rotary position embeddings query_states = rope(query_states) key_states = rope(key_states) # Transpose tensors from BSHD to BHSD dimension for scaled_dot_product_attention query_states = query_states.transpose(1, 2) key_states = key_states.transpose(1, 2) value_states = value_states.transpose(1, 2) # Use PyTorch’s optimized attention implementation # setting is_causal=True is incompatible with setting explicit attention mask attn_output = F.scaled_dot_product_attention( query_states, key_states, value_states, attn_mask=attn_mask, dropout_p=0.0, enable_gqa=True, ) # Transpose output tensor from BHSD to BSHD dimension, reshape to 3D, and then project output attn_output = attn_output.transpose(1, 2).reshape(bs, seq_len, self.hidden_size) attn_output = self.o_proj(attn_output) return attn_output class LlamaMLP(nn.Module): “”“Feed-forward network with SwiGLU activation.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() # Two parallel projections for SwiGLU self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False) self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False) self.act_fn = F.silu # SwiGLU activation function # Project back to hidden size self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False) def forward(self, x: Tensor) -> Tensor: # SwiGLU activation: multiply gate and up-projected inputs gate = self.act_fn(self.gate_proj(x)) up = self.up_proj(x) return self.down_proj(gate * up) class LlamaDecoderLayer(nn.Module): “”“Single transformer layer for a Llama model.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=1e–5) self.self_attn = LlamaAttention(config) self.post_attention_layernorm = nn.RMSNorm(config.hidden_size, eps=1e–5) self.mlp = LlamaMLP(config) def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding, attn_mask: Tensor) -> Tensor: # First residual block: Self-attention residual = hidden_states hidden_states = self.input_layernorm(hidden_states) attn_outputs = self.self_attn(hidden_states, rope=rope, attn_mask=attn_mask) hidden_states = attn_outputs + residual # Second residual block: MLP residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) hidden_states = self.mlp(hidden_states) + residual return hidden_states class LlamaModel(nn.Module): “”“The full Llama model without any pretraining heads.”“” def __init__(self, config: LlamaConfig) -> None: super().__init__() self.rotary_emb = RotaryPositionEncoding( config.hidden_size // config.num_attention_heads, config.max_position_embeddings, ) self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) self.layers = nn.ModuleList([ LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers) ]) self.norm = nn.RMSNorm(config.hidden_size, eps=1e–5) def forward(self, input_ids: Tensor, attn_mask: Tensor) -> Tensor: # Convert input token IDs to embeddings hidden_states = self.embed_tokens(input_ids) # Process through all transformer layers, then the final norm layer for layer in self.layers: hidden_states = layer(hidden_states, rope=self.rotary_emb, attn_mask=attn_mask) hidden_states = self.norm(hidden_states) # Return the final hidden states return hidden_states class LlamaForPretraining(nn.Module): def __init__(self, config: LlamaConfig) -> None: super().__init__() self.base_model = LlamaModel(config) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) def forward(self, input_ids: Tensor, attn_mask: Tensor) -> Tensor: hidden_states = self.base_model(input_ids, attn_mask) return self.lm_head(hidden_states) def create_causal_mask(batch: Tensor, dtype: torch.dtype = torch.float32) -> Tensor: “”“Create a causal mask for self-attention. Args: batch: Batch of sequences, shape (batch_size, seq_len) dtype: Data type of the mask Returns: Causal mask of shape (seq_len, seq_len) ““” batch_size, seq_len = batch.shape mask = torch.full((seq_len, seq_len), float(“-inf”), device=batch.device, dtype=dtype) \ .triu(diagonal=1) return mask def create_padding_mask(batch: Tensor, padding_token_id: int, dtype: torch.dtype = torch.float32) -> Tensor: “”“Create a padding mask for a batch of sequences for self-attention. Args: batch: Batch of sequences, shape (batch_size, seq_len) padding_token_id: ID of the padding token dtype: Data type of the mask Returns: Padding mask of shape (batch_size, 1, seq_len, seq_len) ““” padded = torch.zeros_like(batch, device=batch.device, dtype=dtype) \ .masked_fill(batch == padding_token_id, float(“-inf”)) mask = padded[:,:,None] + padded[:,None,:] return mask[:, None, :, :] # Generator function to create padded sequences of
Elon Musk’s xAI To Expand Computing Capacity To 2 GW | Technology News
New Delhi: Tesla and SpaceX CEO Elon Musk’s xAI company has purchased a third building near its existing Memphis sites in the US, that will bring its artificial intelligence (AI) computing capacity to almost 2 gigawatts (GW). Elon Musk has already built one data centre in Memphis, known as Colossus, and is constructing a second centre nearby dubbed Colossus 2, according to multiple reports. The newly acquired building is in Southaven, Mississippi, and adjoins the Colossus 2 facility, according to reports citing people familiar with the matter. “xAI has bought a third building called Macrohardrr,” Musk posted on social media platform X, saying it will “take @xAI training compute to almost 2GW.” A gigawatt is enough to provide electricity for about 7,50,000 US homes. Musk has publicly discussed plans to build the world’s largest data centre for AI training and previously said Colossus 2 will eventually have 5,50,000 chips from Nvidia, costing tens of billions of dollars. Add Zee News as a Preferred Source Moreover, Musk’s xAI Holdings is reportedly in talks to raise new funding at around $230 billion valuation. Musk owns a 53 per cent stake in xAI Holdings, worth $60 billion. Elon Musk took a dig at Wikipedia in October, claiming Grokipedia, developed by xAI, will surpass the popular online encyclopedia “by several orders of magnitude in breadth, depth and accuracy.” Grokipedia is an AI-powered encyclopedia that aims to challenge what Musk calls a “woke” and biased Wikipedia. He described Grokipedia as a “massive improvement over Wikipedia” and said it aligns with xAI’s mission to help humanity better understand the universe. Musk’s net worth rose to nearly $750 billion after a US court reinstated Tesla stock options worth $139 billion. According to Forbes’ billionaires index, this development has taken Musk closer to become the world’s first trillionaire.
New Year’s Eve Tech Tip: How You Place Your Smartphone On Table Can Improve Privacy, Focus, Battery, And Mental Peace-Explained | Technology News
Face Down Phone Benefits: As we celebrate New Year’s Eve and spend time with family and friends, smartphones have become an important part of our daily lives. Whether we are working, attending meetings, eating, or enjoying moments with loved ones, our phone is usually kept right in front of us. However, one small thing is often ignored. Should the phone be kept with the screen facing up or down? This small habit may look not crucial, but it directly affects our concentration, peace of mind, privacy, and digital health. Today, keeping the phone screen facing up has become a common mistake. It causes many problems without us even noticing them. Your Privacy Is At Risk When Screen Faces Up Add Zee News as a Preferred Source When your phone is kept with the screen facing up, it can show your private information without you knowing it. Bank messages, OTPs, personal chats, or office notifications can be seen by people sitting nearby. Many times, we do not even notice what appeared on the screen or who may have seen it. Keeping the phone face down removes this risk immediately. At a time when digital privacy is becoming more important, this simple habit helps keep your personal information safe without any extra effort. Notifications Make It Hard to Concentrate One of the biggest strengths of a smartphone is notifications, but they can also become its biggest weakness. When the screen is facing up, every vibration or flash of light pulls your attention. Even if you do not plan to check the phone, your mind automatically shifts toward it. Keeping the phone face down removes this visual distraction and helps you stay focused on work, conversations, or studies. Over time, this habit teaches your brain that not every alert needs immediate attention. (Also Read: Moto G-Series Smartphone Users Alarmed After Device Reportedly Bursts Into Flames; User Slams Nehru Place Service Centre | Viral Video) How Screen Position Affects Your Mind When a phone keeps lighting up in front of you, your mind stays in alert mode all the time. This can make you feel tired and restless. Keeping the phone face down tells your brain that the phone is not important at that moment. As a result, you feel more calm and relaxed. Whether you are with your family or sitting alone, this habit helps you pay more attention to what is happening around you. Why Keeping the Phone Face Down Is Safer The screen and camera are the most costly and sensitive parts of a smartphone. When the phone is kept with the screen facing up, water drops, tea or coffee, and food pieces can fall on it. The camera lens can also get damaged slowly by rubbing against the table. Keeping the phone face down protects both the screen and the camera. It also lowers the chance of the phone slipping or falling, especially on smooth tables. How Face-Down Phone Saves Your Battery Every time the phone screen lights up and you unlock it, the battery gets used a little. When the phone is kept face down, notifications are less tempting, so you pick up the phone less often. This helps reduce screen time, makes the battery last longer, and puts less strain on your eyes. It also gives a small but healthy break to both the phone and the user. Focus on Real Life, Not Your Phone By keeping your phone face down, you take control instead of letting the phone control you. It also shows the people around you that you are giving them your full attention. Slowly, you start paying more attention to real-life moments instead of constant phone distractions. This balance is the key to a healthy digital life, where technology helps you instead of taking over.
How Train Wi-Fi Works: Does Connection Get Lost At 120 km/hr? Check List Of Trains Offering Free Internet Service | Technology News
With more people relying on the internet while travelling, Wi-Fi on trains has become an important facility for passengers. Many wonder how internet works on a moving train and whether high speed—sometimes above 100 kmph affects the connection. Here’s a simple explanation of how train Wi-Fi works and which trains currently offer this service. How Train Wi-Fi Works? Train Wi-Fi does not come from satellites directly to passengers’ phones. Instead, trains are fitted with special routers and antennas on the roof. These antennas connect to nearby mobile towers using 4G or 5G networks, just like a mobile phone does. Add Zee News as a Preferred Source Inside the train, this signal is distributed to passengers through internal Wi-Fi routers installed in coaches. The system automatically switches between mobile towers as the train moves, ensuring continuous internet access. This process is known as “handover” and happens within seconds. Does Internet Stop at High Speeds? Even at speeds of 100–130 kmph, Wi-Fi generally continues to work. Modern mobile networks are designed to support fast-moving users, such as those in trains or cars. However, brief slowdowns or disconnections can happen while passing through tunnels, remote areas, forests, or regions with weak network coverage. Internet speed may also reduce when many passengers are connected at the same time, especially during peak travel hours. (Also Read: New Year’s Eve Tech Tip: How You Place Your Smartphone On Table Can Improve Privacy, Focus, Battery, And Mental Peace-Explained) Which Trains Offer Wi-Fi in India? Indian Railways provides Wi-Fi services under the RailWire program, operated by RailTel. Free Wi-Fi is available at over 6,000 railway stations across the country. Some premium trains and routes also offer onboard Wi-Fi, including: Vande Bharat Express Shatabdi Express Rajdhani Express Gatimaan Express Selected Tejas Express routes Future of Train Connectivity Indian Railways is working to expand onboard Wi-Fi and improve signal strength using advanced LTE and upcoming 5G technologies. The goal is to offer smoother internet access for work, entertainment, and communication during long journeys.
Training a Tokenizer for Llama Model
The Llama family of models are large language models released by Meta (formerly Facebook). These decoder-only transformer models are used for generation tasks. Almost all decoder-only models nowadays use the Byte-Pair Encoding (BPE) algorithm for tokenization. In this article, you will learn about BPE. In particular, you will learn: What BPE is compared to other tokenization algorithms How to prepare a dataset and train a BPE tokenizer How to use the tokenizer Training a Tokenizer for Llama ModelPhoto by Joss Woodhead. Some rights reserved. Let’s get started. Overview This article is divided into four parts; they are: Understanding BPE Training a BPE tokenizer with Hugging Face tokenizers library Training a BPE tokenizer with SentencePiece library Training a BPE tokenizer with tiktoken library Understanding BPE Byte-Pair Encoding (BPE) is a tokenization algorithm used to tokenize text into sub-word units. Instead of splitting text into only words and punctuation, BPE can further split the prefixes and suffixes of words so that prefixes, stems, and suffixes can each be associated with meaning in the language model. Without sub-word tokenization, a language model would find it difficult to learn that “happy” and “unhappy” are antonyms of each other. BPE is not the only sub-word tokenization algorithm. WordPiece, which is the default for BERT, is another one. A well-implemented BPE does not need “unknown” in the vocabulary, and nothing is OOV (Out of Vocabulary) in BPE. This is because BPE can start with 256 byte values (hence known as byte-level BPE) and then merge the most frequent pairs of tokens into a new vocabulary until the desired vocabulary size is reached. Nowadays, BPE is the tokenization algorithm of choice for most decoder-only models. However, you do not want to implement your own BPE tokenizer from scratch. Instead, you can use tokenizer libraries such as Hugging Face’s tokenizers, OpenAI’s tiktoken, or Google’s sentencepiece. Training a BPE tokenizer with Hugging Face tokenizers Library To train a BPE tokenizer, you need to prepare a dataset so the tokenizer algorithm can determine the most frequent pair of tokens to merge. For decoder-only models, a subset of the model’s training data is usually appropriate. Training a tokenizer is time-consuming, especially for large datasets. However, unlike a language model, a tokenizer does not need to learn the language context of the text, only how often tokens appear in a typical text corpus. While you may need trillions of tokens to train a good language model, you only need a few million tokens to train a good tokenizer. As mentioned in a previous article, there are several well-known text datasets for language model training. For a toy project, you may want a smaller dataset for faster experimentation. The HuggingFaceFW/fineweb dataset is a good choice for this purpose. In its full size, it is a 15 trillion token dataset, but it also has 10B, 100B, and 350B sizes for smaller projects. The dataset is derived from Common Crawl and filtered by Hugging Face to improve data quality. Below is how you can print a few samples from the dataset: import datasets dataset = datasets.load_dataset(“HuggingFaceFW/fineweb”, name=”sample-10BT”, split=”train”, streaming=True) count = 0 for sample in dataset: print(sample) count += 1 if count >= 5: break import datasets dataset = datasets.load_dataset(“HuggingFaceFW/fineweb”, name=“sample-10BT”, split=“train”, streaming=True) count = 0 for sample in dataset: print(sample) count += 1 if count >= 5: break Running this code will print the following: {‘text’: ‘|Viewing Single Post From: Spoilers for the Week of February 11th|\n|Lil||F…’, ‘id’: ‘<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>’, ‘dump’: ‘CC-MAIN-2013-20’, ‘url’: ‘http://daytimeroyaltyonline.com/single/?p=8906650&t=8780053’, ‘date’: ‘2013-05-18T05:48:59Z’, ‘file_path’: ‘s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war…’, ‘language’: ‘en’, ‘language_score’: 0.8232095837593079, ‘token_count’: 142} {‘text’: ‘*sigh* Fundamentalist community, let me pass on some advice to you I learne…’, ‘id’: ‘<urn:uuid:ba819eb7-e6e6-415a-87f4-0347b6a4f017>’, ‘dump’: ‘CC-MAIN-2013-20’, ‘url’: ‘http://endogenousretrovirus.blogspot.com/2007/11/if-you-have-set-yourself-on…’, ‘date’: ‘2013-05-18T06:43:03Z’, ‘file_path’: ‘s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war…’, ‘language’: ‘en’, ‘language_score’: 0.9737711548805237, ‘token_count’: 703} … {‘text’: ‘|Viewing Single Post From: Spoilers for the Week of February 11th|\n|Lil||F…’, ‘id’: ‘<urn:uuid:39147604-bfbe-4ed5-b19c-54105f8ae8a7>’, ‘dump’: ‘CC-MAIN-2013-20’, ‘url’: ‘http://daytimeroyaltyonline.com/single/?p=8906650&t=8780053’, ‘date’: ‘2013-05-18T05:48:59Z’, ‘file_path’: ‘s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war…’, ‘language’: ‘en’, ‘language_score’: 0.8232095837593079, ‘token_count’: 142} {‘text’: ‘*sigh* Fundamentalist community, let me pass on some advice to you I learne…’, ‘id’: ‘<urn:uuid:ba819eb7-e6e6-415a-87f4-0347b6a4f017>’, ‘dump’: ‘CC-MAIN-2013-20’, ‘url’: ‘http://endogenousretrovirus.blogspot.com/2007/11/if-you-have-set-yourself-on…’, ‘date’: ‘2013-05-18T06:43:03Z’, ‘file_path’: ‘s3://commoncrawl/crawl-data/CC-MAIN-2013-20/segments/1368696381249/war…’, ‘language’: ‘en’, ‘language_score’: 0.9737711548805237, ‘token_count’: 703} … For training a tokenizer (and even a language model), you only need the text field of each sample. To train a BPE tokenizer using the tokenizers library, you simply feed the text samples to the trainer. Below is the complete code: from typing import Iterator import datasets from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, normalizers # Load FineWeb 10B sample (using only a slice for demo to save memory) dataset = datasets.load_dataset(“HuggingFaceFW/fineweb”, name=”sample-10BT”, split=”train”, streaming=True) def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: “””Get texts from the dataset until the limit is reached or the dataset is exhausted””” count = 0 for sample in dataset: yield sample[“text”] count += 1 if limit and count >= limit: break # Initialize a BPE model: either byte_fallback=True or set unk_token=”[UNK]” tokenizer = Tokenizer(models.BPE(byte_fallback=True)) tokenizer.normalizer = normalizers.NFKC() tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True, use_regex=False) tokenizer.decoder = decoders.ByteLevel() # Trainer trainer = trainers.BpeTrainer( vocab_size=25_000, min_frequency=2, special_tokens=[“[PAD]”, “[CLS]”, “[SEP]”, “[MASK]”], show_progress=True, ) # Train and save the tokenizer to disk texts = get_texts(dataset, limit=10_000) tokenizer.train_from_iterator(texts, trainer=trainer) tokenizer.save(“bpe_tokenizer.json”) # Reload the tokenizer from disk tokenizer = Tokenizer.from_file(“bpe_tokenizer.json”) # Test: encode/decode text = “Let’s have a pizza party! 🍕” enc = tokenizer.encode(text) print(“Token IDs:”, enc.ids) print(“Decoded:”, tokenizer.decode(enc.ids)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 from typing import Iterator import datasets from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders, normalizers # Load FineWeb 10B sample (using only a slice for demo to save memory) dataset = datasets.load_dataset(“HuggingFaceFW/fineweb”, name=“sample-10BT”, split=“train”, streaming=True) def get_texts(dataset: datasets.Dataset, limit: int = 100_000) -> Iterator[str]: “”“Get texts from the dataset until the limit is reached or the dataset is exhausted”“” count = 0 for sample in dataset:
Govt Releases White Paper On Democratising Access To AI Infrastructure | Technology News
New Delhi: The Office of the Principal Scientific Adviser (PSA) to the Government on Tuesday released a white paper on democratizing access to Artificial Intelligence (AI) infrastructure. The white paper defines democratising access to AI infrastructure as making the AI infrastructure – compute, datasets, and model ecosystem available and affordable, such that it reaches a wide set of users. It refers to empowering a wide set of users to engage with and benefit from AI capabilities. When compute, datasets, and model tooling are broadly available, individuals and institutions expand what they can do, like aiming to design local language tools and adapt assistive technologies. The white paper has been prepared with inputs and feedback from domain experts and stakeholders, including the Niti Aayog, to foster informed deliberation and action in shaping India’s AI policy and governance landscape. Add Zee News as a Preferred Source “With AI becoming central to innovation and economic progress, access to compute, datasets, and model ecosystems must be made broad, affordable, and inclusive. These resources are concentrated in a few global firms and urban centres, limiting equitable participation,” the office of the PSA said in a post on social media. “For India, democratising access means treating AI infrastructure as a shared national resource, empowering innovators across regions to build local-language tools, adapt assistive technologies, and create solutions aligned with India’s diverse needs,” it added. The white paper highlights key enablers aligned with India’s AI governance vision, including expanding access to high-quality, representative datasets; providing affordable and reliable computing resources; and integrating AI with Digital Public Infrastructure (DPI). Democratising access to AI infrastructure is critical for ensuring fair and equitable opportunities and benefits across the country, from villages to cities, and from small institutions and startups to industry. Through tools and platforms like AIKosha, India AI Compute, and TGDeX, India’s AI ecosystem is supporting innovation and services by increasing access. Further, dedicated government initiatives on infrastructure development and increasing access to data and computing resources would empower the IndiaAI Mission, line ministries, sectoral regulators, and state governments, the white paper said.