TCP Stream Aggregator

The TCP stream aggregator is an impressive script that accelerates the analysis of pcap files obtained from Wireshark. It generates a simple text file containing all the information from each connection captured in the network traffic. This file is easily reviewable and searchable within the command line using VIM or nano. Essentially, the script automates the collection of data from the pcap file and cleans it to enhance readability. Without further delay, let’s examine the script line by line.

#! usr/bin/python3
import sys
from subprocess import PIPE, Popen

def commandLine(command):
    process = Popen(
        args = command,
        stdout = PIPE,
        shell = True,
        universal_newlines = True
    )
    return process.communicate()[0]

pcapFile = sys.argv(1)
outputFileName = pcapFile.strip('.')[0]+'_stream.txt'
outputFile = open(outputFileName,'w')
count = 0
while True:
    commandLineInput = 'tshark -r %s -z follow, tcp, ascii, %s'%(pcapFile,count)
    stream = commandLine(commandLineInput)
    stream = stream.split('===================================================================\n')[1]
    stream += '\n----------------------------\n'
    if 'Node 0: :0' not in stream:
        outputFile.write(stream)
    else:
        break
    count += 1
outputFile.close()

First, let’s examine the libraries and the “commandLine()” function, as they are interrelated. These imported libraries enable us to define the “commandLine(command)” function, which takes a command as input, opens it as a datastream using Popen, and returns the output for use in other functions.

import sys
from subprocess import PIPE, Popen

def commandLine(command):
    process = Popen(
        args = command,
        stdout = PIPE,
        shell = True,
        universal_newlines = True
    )
    return process.communicate()[0]

Now, let’s focus on the Popen call, which we assign to the variable “process” for convenience. We pass the “command” variable as the argument for Popen, indicating that it will be the command we work with. We set stdout to PIPE, a special value indicating that the output will be used as input for another command. By setting shell to true, we instruct the function to execute within the terminal. Additionally, setting universal_newlines to true adds a “\n” to each line of the output, making it more readable.

Finally, the return statement of the function is set to process.communicate()[0]. When using Popen, there are two outputs: stdout and stderror. To retrieve only the stdout or output of the subprocess, we specify position 0 in the returned array.

In summary, this first function is relatively straightforward. It takes a command as input, which is executed in the terminal. The output of this command serves as input for subsequent functions, which form the rest of the script. Let’s delve into the details.

pcapFile = sys.argv(1)
outputFileName = pcapFile.strip('.')[0]+'_stream.txt'
outputFile = open(outputFileName,'w')
count = 0

We imported the sys library to enable our script to dynamically specify the pcap file to work with, avoiding the need to hardcode the file name within the script. To achieve this, we utilize sys.argv[], a function from the sys library that retrieves command-line arguments. In this case, sys.argv[1] corresponds to the second argument provided when running our script. For example, if we input the following in the terminal:

myPythonScript.py Somefile.pcap

“Somefile.pcap” would be the second argument, and sys.argv[1] would yield its value.

Since this script analyzes pcap files, we set sys.argv[1] as the variable “pcapFile.” The subsequent section of the script is responsible for creating and naming the output file. To differentiate between the input and output files easily, we extract the initial part of the file name before the “.”, then append “_stream.txt” to it.

Next, we open the output file specified above, using the ‘w’ parameter to overwrite existing content or create a new file if it doesn’t exist. Lastly, we initialize the variable “count” to 0, which serves as a counter later in the script.

while True:
    commandLineInput = 'tshark -r %s -z follow, tcp, ascii, %s'%(pcapFile,count)
    stream = commandLine(commandLineInput)
    stream = stream.split('===================================================================\n')[1]
    stream += '\n----------------------------\n'
    if 'Node 0: :0' not in stream:
        outputFile.write(stream)
    else:
        break
    count += 1
outputFile.close()

In the final section of the script, we employ a while loop. Since we don’t know the number of TCP connections in the pcap file, we need the script to continue running until all connections have been processed.

The first line of the while loop is where the exciting part begins. We define “commandLineInput” as the string “tshark -r %s -z follow, tcp, ascii, %s”. This string will be passed to our commandLine function, as explained earlier. It contains several crucial arguments. Firstly, we utilize the command-line tool “tshark,” which is the terminal version of Wireshark. It reads our pcap file and provides the ASCII representation of TCP connections captured in the file.

Breaking it down further, “-r %s” instructs tshark to read the file indicated by the wildcard “%s.” We define the value of each “%s” at the end of the string as “%s(pcapFile, count).” In this case, the first “%s” corresponds to the pcapFile variable, allowing us to read the specified pcap file when running the script in the terminal (as the second argument in “myPythonScript.py Somefile.pcap”).

The “-z follow, tcp, ascii, %s” flag requests statistics for TCP connections to be written in ASCII format. The second wildcard is represented by the count variable, indicating which stream we are examining in the pcap file. As the program executes, count starts at 0 for the first stream and increments, moving from one TCP connection to the next until all connections have been processed.

stream = commandLine(commandLineInput)
    stream = stream.split('===================================================================\n')[1]
    stream += '\n----------------------------\n'

In the subsequent section, we save the ASCII TCP information from tshark using the variable “stream.” Each TCP connection within the output file contains a significant amount of information. However, in this case, we are only interested in the content following the long sequence of “=” symbols. By utilizing the split command, we split the information into an array and extract everything after the “=” symbols. Additionally, we add a set of dashes and new lines above and below the extracted information to act as delimiters between streams.

if 'Node 0: :0' not in stream:
        outputFile.write(stream)
    else:
        break
    count += 1
outputFile.close()

Now, for the home stretch. We employ a simple if statement within the while loop to determine when to append the tshark output to the final output file and when to stop writing to it. The if statement searches for the string “Node 0: :0,” which appears in the TCP connection data when there are no more connections to process. If the string is NOT found, the output is written to the file. Otherwise, if the string IS found, the loop is terminated.

Finally, we increment the count variable if we are going to iterate through the loop again. When we are no longer in the while loop (indicating the end of all streams), we close the outputFile using the “outputFile.close()” command, ensuring the newly created and formatted TCP stream file is ready for review.

This program encompasses a lot of information, but there are many more ways we can further customize the script to suit specific needs. For instance, within the tshark call, we can specify IP ranges or port connections to focus on. The modified command would resemble the following:

-z "follow,tcp,ascii,200.57.7.197:32891,200.57.7.198:2906"

Rather than analyzing every stream and outputting them, we can query tshark to retrieve specific IP addresses and port ranges exclusively. To avoid hardcoding the IP address and port range, we would introduce an additional argument when running the script in the terminal. The modified terminal input would resemble the following:

myPythonScript.py Somefile.pcap 127.0.0.1:1,127.0.0.1:100

To accommodate this change, we would use sys.argv[2] to capture the IP address and port range. In the script, we would replace the second wildcard (count) with “ipAddressRange” or a similar variable. This adjustment accounts for the option offered by the -z follow argument of tshark, allowing users to specify either a stream index (count in our case) or a particular IP address and port range to examine. Furthermore, additional options for further customization include filtering streams based on specific keywords and identifying streams that contain those keywords.

Thank you for reading. If you would like to know more about the libraries used here or tshark please see the following links for their documentation.

tshark – https://www.wireshark.org/docs/man-pages/tshark.html

subprocess, popen – https://docs.python.org/3/library/subprocess.html#subprocess.Popen

sys – https://www.tutorialsteacher.com/python/sys-module

Leave a Reply

Your email address will not be published. Required fields are marked *