StarCoder2 – open source code completion models

StarCoder2 is a family of code generation models (3B, 7B, and 15B), trained on 600+ programming languages from The Stack v2 and some natural language text such as Wikipedia, Arxiv, and GitHub issues. The models use Grouped Query Attention, a context window of 16,384 tokens, with sliding window attention of 4,096 tokens. The 3B & 7B models were trained on 3+ trillion tokens, while the 15B was trained on 4+ trillion tokens. For more details check out the paper.

StarCoder2 @ Github

StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2-15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.

StarCoder2 offers three model sizes: a 3 billion-parameter model trained by ServiceNow, a 7 billion-parameter model trained by Hugging Face, and a 15 billion-parameter model trained by NVIDIA using NVIDIA NeMo on NVIDIA accelerated infrastructure:

StarCoder2 @ Hugging Face

 

Notes on better search 8/18/2023

Goal: better, more focused search for www.cali.org.

In general the plan is to scrape the site to a vector database, enable embeddings of the vector db in Llama 2, provide API endpoints to search/find things.

Hints and pointers.

  • Llama2-webui – Run any Llama 2 locally with gradio UI on GPU or CPU from anywhere
  • FastAPI – web framework for building APIs with Python 3.7+ based on standard Python type hints
  • Danswer – Ask Questions in natural language and get Answers backed by private sources. It makes use of
    • PostgreSQL – a powerful, open source object-relational database system
    • QDrant – Vector Database for the next generation of AI applications.
    • Typesense – a modern, privacy-friendly, open source search engine built from the ground up using cutting-edge search algorithms, that take advantage of the latest advances in hardware capabilities.

The challenge is to wire together these technologies and then figure out how to get it to play nice with Drupal. One possibility is just to build this with an API and then use the API to interact with Drupal. That approach also offers the possibility of allowing the membership to interact with the API too.

Github :: free-response-scoring by David Colarusso

This repository shares code used to implement the methods described in Unsupervised Machine Scoring of Free Response Answers—Validated Against Law School Final Exams, presented at the Computational Legal Studies Conference, March 2022, hosted by the Center for Computational Law at Singapore Management University.

You can find links to all relevant content either in, or linked to from, the notebook titled Score Exams.

/free-response-scoring

Good alternative to LLM text comparison. Note: patent pending Suffolk University

Customizing GPT-3 for Your Application :: OpenAI

Developers can now fine-tune GPT-3 on their own data, creating a custom version tailored to their application. Customizing makes GPT-3 reliable for a wider variety of use cases and makes running the model cheaper and faster.

You can use an existing dataset of virtually any shape and size, or incrementally add data based on user feedback. With fine-tuning, one API customer was able to increase correct outputs from 83% to 95%. By adding new data from their product each week, another reduced error rates by 50%.

Source: Customizing GPT-3 for Your Application

Adding Symphora, my WordPress blog, to the #fediverse via ActivityPub (Take 2)

Second try. Probably need to be subscribed first, then post.

Why just join a Mastodon instance if I can turn my entire blog in to a node in the federated social network space? This is the first post that attempts this feat. If it works I’ll write up what I did to make it work. In the meantime you can follow these antics and more by following elmer@www.symphora.com and @emasters.

See you on the other side!

The prompt says “Start writing” so, sure, I’ll just start writing. I’m on my phone at the moment so writing is me like swiping but it does have a certain cursive feel to it once you get going. Of course proof reading to avoid bad autocorrect is essential. It also helps to have something to say and that’s a bit subjective I think.

What I’m really looking for is whether or not I can post without a title. Sometimes I have an idea but get stuck staring at the “Add title” and lose my inspiration. Not having to think of a title is nice.

Scraping the Teknoids Mailman PiperMail Archive

Putting this here in case anyone finds themselves in need of something to scrape a Pipermail web archive of a Mailman mailing list. This bit of Python 3 is based on a a bit of Python 2 I found at Scraping GNU Mailman Pipermail Email List Archives. The only changes I made from the original are to update somethings to work in Python 3. It works well for my purposes, generating a single text file of the teknoids list archive from 2005 to today.

#!/usr/bin/env python

import requests
from lxml import html
import gzip
from io import BytesIO

listname = 'teknoids'
url = 'https://lists.teknoids.net/pipermail/' + listname + '/'

response = requests.get(url)
tree = html.fromstring(response.text)

filenames = tree.xpath('//table/tr/td[3]/a/@href')

def emails_from_filename(filename):
print (filename)
response = requests.get(url + filename)
if filename[-3:] == '.gz':
contents = gzip.GzipFile(fileobj=BytesIO(response.content)).read()
else:
contents = response.content
return contents

contents = [emails_from_filename(filename) for filename in filenames]
contents.reverse()

contents = b"\n\n\n\n".join(contents)

with open(listname + '.txt', 'wb') as filehandle:
filehandle.write(contents)

My last post to teknoids: He’s dead, Jim. This list is getting a reboot

I just posted to the teknoids list letting everyone know I’m shutting down the list and replacing it with a Discourse forum at https://discourse.teknoids.net/. here’s the text of the post:

As a few of you may have noticed things have been amiss with the list since late last year when Microsoft decided to put the list on some sort of irrevocable ban list. As a result messages are not being delivered to over half of the subscribers at law schools around the country. That’s nearly 300 people. More troubling to me is that virtually no one who stopped receiving messages even appears to have noticed that they aren’t getting messages anymore.

After trying many, many approaches to getting the ban lifted and staring at the apparent ambivalence of the list itself I’ve decided to shut down the list effective immediately. The mailing list has been around for 30 years and I’ve been the admin for many of those, so it wasn’t an easy decision to make. The list will no longer function after 5:00 PM ET today, Monday April 4, 2022.

Of course on the Internet nothing ever really goes away. The list archives will continue to be available. For those of you interested in continuing the conversation, growing the community, or just generally keeping in touch I’m launching a new website called The Teknoids List at https://discourse.teknoids.net/. The new Teknoids List is a Discourse-based discussion forum that is up and running now. I’d like to invite everyone on the list to head on over and create an account. Tell your friends, tell your neighbors, tell your colleagues! With some luck I hope we can grow the new site into the go to place to discuss and discover that latest in tech + legal education.

If you have any questions or concerns please reply to me directly or, better, head on over to https://discourse.teknoids.net/ and we’ll talk them through.

Thanks,
Elmer
Chief Teknoid
https://discourse.teknoids.net/