Lurker: The twich chat scraper

Twitch logo

Twitch has a lot of interesting conversations happening on it’s platform. You can get a general sense of peoples opinions on upcoming game releases, game developers, other streamers, and even pizza. For some people, knowing the outcomes of these conversations can be valuable data.

In this post, I want to walk you through the first step of a data analysis system for taking this raw data and gaining insights from it. This first step is getting the data in the first place. To do this, we will take a look at how the twitch chat works and make a simple worker to act as a read-only client. After we will look at options on how to pass that data along to get saved in a central place.

The Plan

The plan is to check out how the twitch chat works without an account, and copy that functionality of the website in a standalone worker. From there we will format that data and send it wherever we need it. Speaking of where we need it, A message broker such as RabbitMQ will give us the ability to not really care what we are sending it to. We will package the data into a message and rely on listeners of that queue to store and analyze the data.

Twitch Chat

The best way to figure this out is to see what network requests happen when new chat messages come it. To do that, we will open our dev tools on a live twitch channel…That’s weird, there doesn’t seem to be anything coming in related to chat. It has to be coming from somewhere so lets refresh the page.

Twitch IRC Websocket

There it is! Instead of it being multiple requests, one for each message, a websocket is opened between our browser and a url irc-ws.chat.twitch.tv irc-ws tells me they are using the IRC protocol over websockets. Lets take a look inside this websocket:

Twitch websocket internals

The things that jump out at me here are PASS SCHMOOPIIE which is a hard coded password NICK justinfan56017 which I am assuming is a generated username for people who aren’t logged in, and JOIN #chessbrah which is us joining the channel that I selected. After that we see some data about the room and we start recieving messages. Lets look at one if these messages, I have stripped this one of any identifying data of the user who sent it:

@badge-info=subscriber/27;badges=subscriber/24,premium/1;client-nonce=REDACTED;color=#9ACD32;display-name=REDACTED;emotes=;first-msg=0;flags=;id=REDACTED;mod=0;reply-parent-display-name=REDACTED;reply-parent-msg-body=REDACTED;reply-parent-msg-id=REDACTED;reply-parent-user-id=REDACTED;reply-parent-user-login=REDACTED;returning-chatter=0;room-id=REDACTED;subscriber=1;tmi-sent-ts=REDACTED;turbo=0;user-id=REDACTED;user-type= :REDACTED!REDACTED@REDACTED.tmi.twitch.tv PRIVMSG #chessbrah :@REDACTED he streamed yesterday

That is a lot of data for one message!

We have data on what badge the user is using, what color their name is, their display name, their message itself at the end, as well as data on what they were replying to if they used that feature, it even includes a copy of the previous message, who sent it and the message itself though I think it will be truncated. If we are collecting all messages we will be able to tie it together with the reply-parent-msg-id field. We can also see flags for if the user is a subscriber of this channel, a mod of this channel, or a twitch turbo subscriber.

There are other commands and messages that we can see in this stream of data, but for now we are going to focus just on this type of message and one other. That other message is the PING message. It seems responding to this with a PONG is how twitch knows we are still listening to the stream and that they should continue sending data.

Code time!

If you want to follow along with the completed code, you can get the code at the repo on github.

First lets make a struct with the data we want:

type Message struct {
	Username string
	UserId string
	MessageText string
	IsSub bool
	IsMod bool
	IsTurbo bool
	Channel string
	DateSent time.Time
}

This has the basics. If we want more, we can always come back to this and add more fields. Now, lets talk message parsing. The first thing I noticed is that the message uses semicolons as seperators. The next thing I notice is that we get the exact message that the user sends, meaning we can’t just split the entire string at the semicolons just yet as the user may have a semicolon in their message. However, we can split at the semicolons after we take care of the jumbled mess in the value of the user-type key.

func ParseMessage(s string) models.Message {
	parsedMessage := models.Message{}
    parsedMessage.DateSent = time.Now().UTC()

	infoMessageSplit := strings.Split(s, "user-type=")
	trimmed := strings.SplitAfter(infoMessageSplit[1], "PRIVMSG")
	message := strings.SplitN(trimmed[1], ":", 2)
	parsedMessage.MessageText = strings.Trim(message[1], "\r\n")
	parsedMessage.Channel = strings.Trim(message[0], " #")

	splitMessageInfo := strings.Split(infoMessageSplit[0], ";")

	for _, item := range splitMessageInfo {
		splitItem := strings.Split(item, "=")
		switch splitItem[0] {
		case "display-name":
			parsedMessage.Username = splitItem[1]
		case "user-id":
			parsedMessage.UserId = splitItem[1]
		case "mod":
			parsedMessage.IsMod, _ = strconv.ParseBool(splitItem[1])
		case "subscriber":
			parsedMessage.IsSub, _ = strconv.ParseBool(splitItem[1])
		case "turbo":
			parsedMessage.IsTurbo, _ = strconv.ParseBool(splitItem[1])
		}
	}
	

	return parsedMessage
}

That wasn’t terrible I suppose. First thing we do is set the time we recieved the message, it doesn’t have to be entirely accurate. Next, we split the message at user-type=. The reason for this is that tag is the beginning of what the user actually sent. If we split the message at the semicolons then we would have to do some funky stuff to get the message back to how it was. So now we have the metadata at infoMessageSplit[0] and the actual message at infoMessageSplit[1]. We then trim the message down further until we get the raw message text and channel name and we add that to our parsedMessage object.

The metadata parsing is a bit easier, it is key=value with semicolon seperators between, so we seperate at those semicolons and run each key/value pair through a switch statement to check if we want that data and if so where to put it.

Now we have a message parsed into an object that we can do things with! As I said before, I send this data to a rabbitMQ queue and save it to a database on the other side so I can (eventually) do searching and analysis. You can do whatever you wish with it.

Thanks for reading, if I make another post about this project I will add a link to it here.

The Plan#

Twitch Chat#

Code time!#

The Plan

Twitch Chat

Code time!