Package sentspace

Sentspace 0.0.2 (C) 2020-2022 EvLab, MIT BCS. All rights reserved.

Homepage: https://sentspace.github.io/sentspace

For questions, email:

{gretatu,asathe} @ mit.edu

sentspace

About

sentspace gives users a better understanding of the distribution of linguistic stimuli, specifically, sentences, in comparison with large corpora. sentspace achieves this using a collection of psycholinguistic datasets and linguistic features. Imagine you have collected a set of sentences for use in a language experiment, or generated sentences using an artificial neural network language model. How does your set of sentences compare to naturally occurring sentences? What are the dimensions along which your sentences deviate from normal? sentspace provides you with numerical estimates of these values, as well as allows you to visualize the high-dimensional space in a web-based application.

Online interface: http://sentspace.github.io/hosted

Screencast video demo: https://youtu.be/a66_nvcCakw

CLI usage demo:

drawing

Documentation CircleCI

Documentation is generated using pdoc3 and available online (click on the title above).

Usage

1. CLI

Example: get lexical and embedding features for stimuli from a csv containing columns for 'sentence' and 'index'.

$ python3 -m sentspace -h
usage: 


positional arguments:
  input_file            path to input file or a single sentence. If supplying a file, it must be .csv .txt or .xlsx, e.g., example/example.csv

optional arguments:
  -h, --help            show this help message and exit
  -sw STOP_WORDS, --stop_words STOP_WORDS
                        path to delimited file of words to filter out from analysis, e.g., example/stopwords.txt
  -b BENCHMARK, --benchmark BENCHMARK
                        path to csv file of benchmark corpora For example benchmarks/lexical/UD_corpora_lex_features_sents_all.csv
  -p PARALLELIZE, --parallelize PARALLELIZE
                        use multiple threads to compute features? disable using `-p False` in case issues arise.
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        path to output directory where results may be stored
  -of {pkl,tsv}, --output_format {pkl,tsv}
  -lex LEXICAL, --lexical LEXICAL
                        compute lexical features? [False]
  -syn SYNTAX, --syntax SYNTAX
                        compute syntactic features? [False]
  -emb EMBEDDING, --embedding EMBEDDING
                        compute high-dimensional sentence representations? [False]
  -sem SEMANTIC, --semantic SEMANTIC
                        compute semantic (multi-word) features? [False]
  --emb_data_dir EMB_DATA_DIR
                        path to output directory where results may be stored

2. As a library

Example: get embedding features in a script

import sentspace

s = sentspace.Sentence('The person purchased two mugs at the price of one.')
emb_features = sentspace.embedding.get_features(s)

Example: parallelize getting features for multiple sentences using multithreading

import sentspace

sentences = [
    'Hello, how may I help you today?',
    'The person purchased three mugs at the price of five!',
    "She's leaving home.",
    'This is an example sentence we want features of.'
             ]

# construct sentspace.Sentence objects from strings
sentences = [*map(sentspace.Sentence, sentences)]
# make use of parallel processing to get lexical features for the sentences
lex_features = sentspace.utils.parallelize(sentspace.lexical.get_features, sentences,
                                           wrap_tqdm=True, desc='Computing lexical features')

Installing

1. Install using Conda and Poetry

Prerequisites: conda 1. Use your own or create new conda environment: conda create -n sentspace-env python=3.8 (if using your own, we will assume your environment is called sentspace-env) - Activate it: conda activate sentspace-env 2. Install poetry: curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python -

  1. Install polyglot dependencies using conda: conda install -c conda-forge pyicu morfessor icu -y
  2. Install remaining packages using poetry: poetry install

If after the above steps the installation gives you trouble, you may need to refer to: polyglot install instructions, which lists how to obtain ICU, a dependency for polyglot.

To use sentspace after installation, simply make sure to have the conda environment active and all packages up to date using poetry install

CircleCI

Requirements: singularity or docker.

Singularity:

singularity shell docker://aloxatel/sentspace:latest

Alternatively, from the root of the repo, bash singularity-shell.sh). this step can take a while when you run it for the first time as it needs to download the image from docker hub and convert it to singularity image format (.sif). however, each subsequent run will execute rapidly.

Docker: use corresponding commands for Docker.

now you are inside the container and ready to run sentspace!

3. Manual install (use as last resort)

On Debian/Ubuntu-like systems, follow the steps below. On other systems (RHEL, etc.), substitute commands and package names with appropriate alternates.

# optional (but recommended): 
# create a virtual environment using your favorite method (venv, conda, ...) 
# before any of the following

# install basic packages using apt (you likely already have these)
sudo apt update
sudo apt install python3.8 python3.8-dev python3-pip
sudo apt install python2.7 python2.7-dev 
sudo apt install build-essential git

# install ICU
DEBIAN_FRONTEND="noninteractive" TZ="America/New_York" sudo apt install python3-icu

# install ZS package separately (pypi install fails)
python3.8 -m pip install -U pip cython
git clone https://github.com/njsmith/zs
cd zs && git checkout v0.10.0 && pip install .

# install rest of the requirements using pip
cd .. # make sure you're in the sentspace/ directory
pip install -r ./requirements.txt
polyglot download morph2.en

Submodules

In general, each submodule implements a major class of features. You can run each module on its own by specifying its flag and 0 or 1 with the module call:

python -m sentspace -lex {0,1} -syn {0,1} -emb {0,1} <input_file_path>

sentspace.lexical

Obtain lexical (word-level) features that are not dependendent on the sentence context. These features are returned on a word-by-word level and also averaged at the sentence level to provide each sentence a corresponding value. - typical age of acquisition - n-gram surprisal n={1,2,3,4} - etc. (comprehensive list will be updated)

sentspace.syntax

Description pending

sentspace.embedding

Obtain high dimensional representations of sentences using word-embedding and contextualized encoder models. - glove - Huggingface model hub (gpt2-xl, bert-base-uncased)

semantic

Multi-word features computed using partial or full sentence context. - PMI (pointwise mutual information) - Language model-based perplexity/surprisal Not Implemented yet

Contributing

Any contributions you make are greatly appreciated, and no contribution is too small to contribute.

  1. Fork the project on Github (how to fork)
  2. Create your feature/patch branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request (PR) and we will take a look asap!

Whom to contact for help

  • gretatu % mit ^ edu
  • asathe % mit ^ edu

(C) 2020-2022 EvLab, MIT BCS

Expand source code
'''
    ### Sentspace 0.0.2 (C) 2020-2022 [EvLab](evlab.mit.edu), MIT BCS. All rights reserved.

    Homepage: https://sentspace.github.io/sentspace

    For questions, email:
    
    `{gretatu,asathe} @ mit.edu`    

    .. include:: ../README.md
'''

from collections import defaultdict
from pathlib import Path

import sentspace.utils as utils
import sentspace.syntax as syntax
import sentspace.lexical as lexical
import sentspace.embedding as embedding

from sentspace.Sentence import Sentence

import pandas as pd
from functools import reduce 
from itertools import chain
from tqdm import tqdm


def run_sentence_features_pipeline(input_file: str, stop_words_file: str = None,
                                   benchmark_file: str = None, output_dir: str = None,
                                   output_format: str = None, batch_size: int = 2_000,
                                   process_lexical: bool = False, process_syntax: bool = False,
                                   process_embedding: bool = False, process_semantic: bool = False,
                                   parallelize: bool = True,
                                   # preserve_metadata: bool = True,
                                   syntax_port: int = 8000,
                                   limit: float = float('inf'), offset: int = 0,
                                   emb_data_dir: str = None) -> Path:
    """
    Runs the full sentence features pipeline on the given input according to
    requested submodules (currently supported: `lexical`, `syntax`, `embedding`,
    indicated by boolean flags).
        
    Returns an instance of `Path` pointing to the output directory resulting from this
    run of the full pipeline. The output directory contains Pickled or TSVed pandas 
    DataFrames containing the requested features.


    Args:
        input_file (str): path to input text file containing sentences
                            one per line [required]
        stop_words_file (str): path to text file containing stopwords to filter
                                out, one per line [optional]
        benchmark_file (str): path to a file containing a benchmark corpus to
                                compare the current input against; e.g. UD [optional]
        
        {lexical,syntax,embedding,semantic,...} (bool): compute submodule features? [False]
    """

    # lock = multiprocessing.Manager().Lock()

    # create output folder
    utils.io.log('creating output folder')
    output_dir = utils.io.create_output_paths(input_file,
                                              output_dir=output_dir,
                                              stop_words_file=stop_words_file)
    config_out = (output_dir / 'this_session_log.txt')
    # with config_out.open('a+') as f:
    #     print(args, file=f)

    utils.io.log('reading input sentences')
    sentences = utils.io.read_sentences(input_file, stop_words_file=stop_words_file)
    utils.io.log('---done--- reading input sentences')


    for part, sentence_batch in enumerate(tqdm(utils.io.get_batches(sentences, batch_size=batch_size, 
                                                                    limit=limit, offset=offset), 
                                               desc='processing batches', total=len(sentences)//batch_size+1)):
        sentence_features_filestem = f'sentence-features_part{part:0>4}'
        token_features_filestem = f'token-features_part{part:0>4}'


        ################################################################################
        #### LEXICAL FEATURES ##########################################################
        ################################################################################
        if process_lexical:
            utils.io.log('*** running lexical submodule pipeline')
            _ = lexical.utils.load_databases(features='all')

            if parallelize:
                lexical_features = utils.parallelize(lexical.get_features, sentence_batch,
                                                     wrap_tqdm=True, desc='Lexical pipeline')
            else:
                lexical_features = [lexical.get_features(sentence)
                                    for _, sentence in enumerate(tqdm(sentence_batch, desc='Lexical pipeline'))]

            lexical_out = output_dir / 'lexical'
            lexical_out.mkdir(parents=True, exist_ok=True)
            utils.io.log(f'outputting lexical token dataframe to {lexical_out}')

            # lexical is a special case since it returns dicts per token (rather than per sentence)
            # so we want to flatten it so that pandas creates a sensible dataframe from it.
            token_df = pd.DataFrame(chain.from_iterable(lexical_features))

            if output_format == 'tsv':
                token_df.to_csv(lexical_out / f'{token_features_filestem}.tsv', sep='\t', index=True)
                token_df.groupby('sentence').mean().to_csv(lexical_out / f'{sentence_features_filestem}.tsv', sep='\t', index=True)
            elif output_format == 'pkl':
                token_df.to_pickle(lexical_out / f'{token_features_filestem}.pkl.gz', protocol=5)
                token_df.groupby('sentence').mean().to_pickle(lexical_out / f'{sentence_features_filestem}.pkl.gz', protocol=5)
            else:
                raise ValueError(f'output format {output_format} not known')

            utils.io.log(f'--- finished lexical pipeline')


        ################################################################################
        #### SYNTAX FEATURES ###########################################################
        ################################################################################
        if process_syntax:
            utils.io.log('*** running syntax submodule pipeline')

            syntax_features = [syntax.get_features(sentence._raw, dlt=True, left_corner=True, identifier=sentence.uid,
                                                   syntax_port=syntax_port)
                                                                        # !!! TODO:DEBUG
                               for i, sentence in enumerate(tqdm(sentence_batch, desc='Syntax pipeline'))] 

            # put all features in the sentence df except the token-level ones
            token_syntax_features = {'dlt', 'leftcorner'}
            sentence_df = pd.DataFrame([{k: v for k, v in feature_dict.items() if k not in token_syntax_features}
                                        for feature_dict in syntax_features], index=[s.uid for s in sentence_batch])

            # output gives us dataframes corresponding to each token-level feature. we need to combine these
            # into a single dataframe
            # we use functools.reduce to apply the pd.concat function to all the dataframes and join dataframes
            # that contain different features for the same tokens
            token_dfs = [reduce(lambda x, y: pd.concat([x, y], axis=1, sort=False),
                                (v for k, v in feature_dict.items() if k in token_syntax_features))
                        for feature_dict in syntax_features]

            for i, df in enumerate(token_dfs):
                token_dfs[i]['index'] = df.index
            #     token_dfs[i].reset_index(inplace=True)

            dicts = [{k: v[list(v.keys())[0]] for k, v in df.to_dict().items()} for df in token_dfs]
            token_df = pd.DataFrame(dicts)
            token_df.index = token_df['index']
            # by this point we have merged dataframes with tokens along a column (rather than just a sentence)
            # now we need to stack them on top of each other to have all tokens across all sentences in a single dataframe
            # token_df = reduce(lambda x, y: pd.concat([x.reset_index(drop=True), y.reset_index(drop=True)]), token_dfs)
            # token_df = token_df.loc[:, ~token_df.columns.duplicated()]

            syntax_out = output_dir / 'syntax'
            syntax_out.mkdir(parents=True, exist_ok=True)
            utils.io.log(f'outputting syntax dataframes to {syntax_out}')

            if output_format == 'tsv':
                sentence_df.to_csv(syntax_out / f'{sentence_features_filestem}.tsv', sep='\t', index=True)
                token_df.to_csv(syntax_out / f'{token_features_filestem}.tsv', sep='\t', index=True)
            elif output_format == 'pkl':
                sentence_df.to_pickle(syntax_out / f'{sentence_features_filestem}.pkl.gz', protocol=5)
                token_df.to_pickle(syntax_out / f'{token_features_filestem}.pkl.gz', protocol=5)
            else:
                raise ValueError(f'unknown output format {output_format}')

            utils.io.log(f'--- finished syntax pipeline')
            

        # Calculate PMI
        # utils.GrabNGrams(sent_rows,pmi_paths)
        # utils.pPMI(sent_rows, pmi_paths)
        # pdb.set_trace()


        ################################################################################
        #### EMBEDDING FEATURES ########################################################
        ################################################################################            
        if process_embedding:
            utils.io.log('*** running embedding submodule pipeline')

            models_and_methods = [
                        # ({'glove.840B.300d'}, {'mean', 'median'}), 
                        # 'distilgpt2',
                        ({'gpt2-xl'}, {'last'}),
                        ({'bert-base-uncased'}, {'first'}),
                     ]
            
            vocab = None
            # does any of the
            if any('glove' in model or 'word2vec' in model for models, _ in models_and_methods for model in models):
                # get a vocabulary across all sentences given as input
                # as the first step, remove any punctuation from the tokens
                stripped_tokens = utils.text.strip_words(chain(*[s.tokens for s in sentence_batch]), method='punctuation')
                # assemble a set of unique tokens
                vocab = set(stripped_tokens)
                # make a spurious function call so that loading glove is cached for subsequent calls
                # TODO allow specifying which glove/w2v version 
                _ = embedding.utils.load_embeddings(emb_file='glove.840B.300d.txt',
                                                    vocab=(*sorted(vocab),),
                                                    data_dir=emb_data_dir)

            if False and parallelize:
                embedding_features = utils.parallelize(embedding.get_features, 
                                                       sentences, models=models, 
                                                       vocab=vocab, data_dir=emb_data_dir,
                                                       wrap_tqdm=True, desc='Embedding pipeline')
            else:
                embedding_features = [embedding.get_features(sentence, models_and_methods=models_and_methods, 
                                                             vocab=vocab, data_dir=emb_data_dir)
                                      for i, sentence in enumerate(tqdm(sentence_batch, desc='Embedding pipeline'))]

            # a misc. stat being computed that needs to be handled better 
            # "no" means no. the stat below is counting how many sentences have NO content words (not to be confused with num. content words)
            no_content_words = len(sentences)-sum(any(s.content_words) for s in sentence_batch)

            utils.io.log(f'sentences without any content words: {no_content_words}/{len(sentence_batch)}; {no_content_words/len(sentence_batch):.2f}')

            embedding_out = output_dir / 'embedding'
            embedding_out.mkdir(parents=True, exist_ok=True)
            
            
            # now we want to output stuff from embedding_features (which is returned by the embedding pipeline)
            # into nicely formatted dataframes.
            # the structure of what is returned by the embedding pipeline is like so:
            #   gpt2-xl:
            #       last: [...] flat multiindexed Pandas series with (layer, dim) as the two indices
            #       mean: 
            #   glove:
            #       mean: [...] flat multiindexed Pandas series with trivially a single layer and 300d, so (1, 300) as the two indices
            # etc.

            # a set of all the models in use
            all_models_methods = {model_name: feature_dict['features'][model_name].keys() 
                                  for feature_dict in embedding_features 
                                    for model_name in feature_dict['features']}

            print(all_models_methods)

            # we want to output BY MODEL
            for model_name in all_models_methods:
                # and BY METHOD
                for method in all_models_methods[model_name]:
                    # each `feature_dict` corresponds to ONE sentence

                    collected = []
                    for feature_dict in embedding_features:

                        # all the keys that contain information such as the sentence, UID, filters used etc,
                        # except for the actual representations obtained from various models.
                        # we need to know this so we can package all this information together with the outputs by model and method
                        metadata_keys = {*feature_dict.keys()} - {'features'} # setminus operator
                        # make a copy of the feature_dict for this sentence excluding the representations themselves
                        meta_df = {key: feature_dict[key] for key in metadata_keys}
                        meta_df.update({'model_name': model_name, 'aggregation': method})
                        meta_df = pd.DataFrame(meta_df, index=[feature_dict['index']])
                        meta_df.columns = pd.MultiIndex.from_product([['metadata'], meta_df.columns, ['']])

                        # model_name -> method -> reprs
                        pooled_reprs = feature_dict['features']
                        flattened_repr = pooled_reprs[model_name][method]

                        collected += [pd.concat([meta_df, flattened_repr], axis=1)]

                    # create further subdirectories by model and aggregation method
                    (embedding_out / model_name / method).mkdir(parents=True, exist_ok=True)

                    sentence_df = pd.concat(collected, axis=0)

                    utils.io.log(f'outputting embedding dataframes for {model_name}-{method} to {embedding_out}')
                    if output_format == 'tsv':
                        sentence_df.to_csv(embedding_out / model_name / method / f'{sentence_features_filestem}.tsv', sep='\t', index=True)
                        # token_df.to_csv(embedding_out / f'{token_features_filestem}.tsv', sep='\t', index=False)
                    elif output_format == 'pkl':
                        sentence_df.to_pickle(embedding_out / model_name / method / f'{sentence_features_filestem}.pkl.gz', protocol=5)
                        # token_df.to_pickle(embedding_out / f'{token_features_filestem}.pkl.gz', protocol=5)


            utils.io.log(f'--- finished embedding pipeline')

        # Plot input data to benchmark data
        #utils.plot_usr_input_against_benchmark_dist_plots(df_benchmark, sent_embed)

        if process_semantic:
            pass


    ################################################################################
    #### \end{run_sentence_features_pipeline} ######################################
    ################################################################################
    return output_dir

Sub-modules

sentspace.Sentence
sentspace.embedding
sentspace.lexical
sentspace.package_lexical
sentspace.syntax
sentspace.utils
sentspace.vis

Functions

def run_sentence_features_pipeline(input_file: str, stop_words_file: str = None, benchmark_file: str = None, output_dir: str = None, output_format: str = None, batch_size: int = 2000, process_lexical: bool = False, process_syntax: bool = False, process_embedding: bool = False, process_semantic: bool = False, parallelize: bool = True, syntax_port: int = 8000, limit: float = inf, offset: int = 0, emb_data_dir: str = None) ‑> pathlib.Path

Runs the full sentence features pipeline on the given input according to requested submodules (currently supported: sentspace.lexical, sentspace.syntax, sentspace.embedding, indicated by boolean flags).

Returns an instance of Path pointing to the output directory resulting from this run of the full pipeline. The output directory contains Pickled or TSVed pandas DataFrames containing the requested features.

Args

input_file : str
path to input text file containing sentences one per line [required]
stop_words_file : str
path to text file containing stopwords to filter out, one per line [optional]
benchmark_file : str
path to a file containing a benchmark corpus to compare the current input against; e.g. UD [optional]

{lexical,syntax,embedding,semantic,…} (bool): compute submodule features? [False]

Expand source code
def run_sentence_features_pipeline(input_file: str, stop_words_file: str = None,
                                   benchmark_file: str = None, output_dir: str = None,
                                   output_format: str = None, batch_size: int = 2_000,
                                   process_lexical: bool = False, process_syntax: bool = False,
                                   process_embedding: bool = False, process_semantic: bool = False,
                                   parallelize: bool = True,
                                   # preserve_metadata: bool = True,
                                   syntax_port: int = 8000,
                                   limit: float = float('inf'), offset: int = 0,
                                   emb_data_dir: str = None) -> Path:
    """
    Runs the full sentence features pipeline on the given input according to
    requested submodules (currently supported: `lexical`, `syntax`, `embedding`,
    indicated by boolean flags).
        
    Returns an instance of `Path` pointing to the output directory resulting from this
    run of the full pipeline. The output directory contains Pickled or TSVed pandas 
    DataFrames containing the requested features.


    Args:
        input_file (str): path to input text file containing sentences
                            one per line [required]
        stop_words_file (str): path to text file containing stopwords to filter
                                out, one per line [optional]
        benchmark_file (str): path to a file containing a benchmark corpus to
                                compare the current input against; e.g. UD [optional]
        
        {lexical,syntax,embedding,semantic,...} (bool): compute submodule features? [False]
    """

    # lock = multiprocessing.Manager().Lock()

    # create output folder
    utils.io.log('creating output folder')
    output_dir = utils.io.create_output_paths(input_file,
                                              output_dir=output_dir,
                                              stop_words_file=stop_words_file)
    config_out = (output_dir / 'this_session_log.txt')
    # with config_out.open('a+') as f:
    #     print(args, file=f)

    utils.io.log('reading input sentences')
    sentences = utils.io.read_sentences(input_file, stop_words_file=stop_words_file)
    utils.io.log('---done--- reading input sentences')


    for part, sentence_batch in enumerate(tqdm(utils.io.get_batches(sentences, batch_size=batch_size, 
                                                                    limit=limit, offset=offset), 
                                               desc='processing batches', total=len(sentences)//batch_size+1)):
        sentence_features_filestem = f'sentence-features_part{part:0>4}'
        token_features_filestem = f'token-features_part{part:0>4}'


        ################################################################################
        #### LEXICAL FEATURES ##########################################################
        ################################################################################
        if process_lexical:
            utils.io.log('*** running lexical submodule pipeline')
            _ = lexical.utils.load_databases(features='all')

            if parallelize:
                lexical_features = utils.parallelize(lexical.get_features, sentence_batch,
                                                     wrap_tqdm=True, desc='Lexical pipeline')
            else:
                lexical_features = [lexical.get_features(sentence)
                                    for _, sentence in enumerate(tqdm(sentence_batch, desc='Lexical pipeline'))]

            lexical_out = output_dir / 'lexical'
            lexical_out.mkdir(parents=True, exist_ok=True)
            utils.io.log(f'outputting lexical token dataframe to {lexical_out}')

            # lexical is a special case since it returns dicts per token (rather than per sentence)
            # so we want to flatten it so that pandas creates a sensible dataframe from it.
            token_df = pd.DataFrame(chain.from_iterable(lexical_features))

            if output_format == 'tsv':
                token_df.to_csv(lexical_out / f'{token_features_filestem}.tsv', sep='\t', index=True)
                token_df.groupby('sentence').mean().to_csv(lexical_out / f'{sentence_features_filestem}.tsv', sep='\t', index=True)
            elif output_format == 'pkl':
                token_df.to_pickle(lexical_out / f'{token_features_filestem}.pkl.gz', protocol=5)
                token_df.groupby('sentence').mean().to_pickle(lexical_out / f'{sentence_features_filestem}.pkl.gz', protocol=5)
            else:
                raise ValueError(f'output format {output_format} not known')

            utils.io.log(f'--- finished lexical pipeline')


        ################################################################################
        #### SYNTAX FEATURES ###########################################################
        ################################################################################
        if process_syntax:
            utils.io.log('*** running syntax submodule pipeline')

            syntax_features = [syntax.get_features(sentence._raw, dlt=True, left_corner=True, identifier=sentence.uid,
                                                   syntax_port=syntax_port)
                                                                        # !!! TODO:DEBUG
                               for i, sentence in enumerate(tqdm(sentence_batch, desc='Syntax pipeline'))] 

            # put all features in the sentence df except the token-level ones
            token_syntax_features = {'dlt', 'leftcorner'}
            sentence_df = pd.DataFrame([{k: v for k, v in feature_dict.items() if k not in token_syntax_features}
                                        for feature_dict in syntax_features], index=[s.uid for s in sentence_batch])

            # output gives us dataframes corresponding to each token-level feature. we need to combine these
            # into a single dataframe
            # we use functools.reduce to apply the pd.concat function to all the dataframes and join dataframes
            # that contain different features for the same tokens
            token_dfs = [reduce(lambda x, y: pd.concat([x, y], axis=1, sort=False),
                                (v for k, v in feature_dict.items() if k in token_syntax_features))
                        for feature_dict in syntax_features]

            for i, df in enumerate(token_dfs):
                token_dfs[i]['index'] = df.index
            #     token_dfs[i].reset_index(inplace=True)

            dicts = [{k: v[list(v.keys())[0]] for k, v in df.to_dict().items()} for df in token_dfs]
            token_df = pd.DataFrame(dicts)
            token_df.index = token_df['index']
            # by this point we have merged dataframes with tokens along a column (rather than just a sentence)
            # now we need to stack them on top of each other to have all tokens across all sentences in a single dataframe
            # token_df = reduce(lambda x, y: pd.concat([x.reset_index(drop=True), y.reset_index(drop=True)]), token_dfs)
            # token_df = token_df.loc[:, ~token_df.columns.duplicated()]

            syntax_out = output_dir / 'syntax'
            syntax_out.mkdir(parents=True, exist_ok=True)
            utils.io.log(f'outputting syntax dataframes to {syntax_out}')

            if output_format == 'tsv':
                sentence_df.to_csv(syntax_out / f'{sentence_features_filestem}.tsv', sep='\t', index=True)
                token_df.to_csv(syntax_out / f'{token_features_filestem}.tsv', sep='\t', index=True)
            elif output_format == 'pkl':
                sentence_df.to_pickle(syntax_out / f'{sentence_features_filestem}.pkl.gz', protocol=5)
                token_df.to_pickle(syntax_out / f'{token_features_filestem}.pkl.gz', protocol=5)
            else:
                raise ValueError(f'unknown output format {output_format}')

            utils.io.log(f'--- finished syntax pipeline')
            

        # Calculate PMI
        # utils.GrabNGrams(sent_rows,pmi_paths)
        # utils.pPMI(sent_rows, pmi_paths)
        # pdb.set_trace()


        ################################################################################
        #### EMBEDDING FEATURES ########################################################
        ################################################################################            
        if process_embedding:
            utils.io.log('*** running embedding submodule pipeline')

            models_and_methods = [
                        # ({'glove.840B.300d'}, {'mean', 'median'}), 
                        # 'distilgpt2',
                        ({'gpt2-xl'}, {'last'}),
                        ({'bert-base-uncased'}, {'first'}),
                     ]
            
            vocab = None
            # does any of the
            if any('glove' in model or 'word2vec' in model for models, _ in models_and_methods for model in models):
                # get a vocabulary across all sentences given as input
                # as the first step, remove any punctuation from the tokens
                stripped_tokens = utils.text.strip_words(chain(*[s.tokens for s in sentence_batch]), method='punctuation')
                # assemble a set of unique tokens
                vocab = set(stripped_tokens)
                # make a spurious function call so that loading glove is cached for subsequent calls
                # TODO allow specifying which glove/w2v version 
                _ = embedding.utils.load_embeddings(emb_file='glove.840B.300d.txt',
                                                    vocab=(*sorted(vocab),),
                                                    data_dir=emb_data_dir)

            if False and parallelize:
                embedding_features = utils.parallelize(embedding.get_features, 
                                                       sentences, models=models, 
                                                       vocab=vocab, data_dir=emb_data_dir,
                                                       wrap_tqdm=True, desc='Embedding pipeline')
            else:
                embedding_features = [embedding.get_features(sentence, models_and_methods=models_and_methods, 
                                                             vocab=vocab, data_dir=emb_data_dir)
                                      for i, sentence in enumerate(tqdm(sentence_batch, desc='Embedding pipeline'))]

            # a misc. stat being computed that needs to be handled better 
            # "no" means no. the stat below is counting how many sentences have NO content words (not to be confused with num. content words)
            no_content_words = len(sentences)-sum(any(s.content_words) for s in sentence_batch)

            utils.io.log(f'sentences without any content words: {no_content_words}/{len(sentence_batch)}; {no_content_words/len(sentence_batch):.2f}')

            embedding_out = output_dir / 'embedding'
            embedding_out.mkdir(parents=True, exist_ok=True)
            
            
            # now we want to output stuff from embedding_features (which is returned by the embedding pipeline)
            # into nicely formatted dataframes.
            # the structure of what is returned by the embedding pipeline is like so:
            #   gpt2-xl:
            #       last: [...] flat multiindexed Pandas series with (layer, dim) as the two indices
            #       mean: 
            #   glove:
            #       mean: [...] flat multiindexed Pandas series with trivially a single layer and 300d, so (1, 300) as the two indices
            # etc.

            # a set of all the models in use
            all_models_methods = {model_name: feature_dict['features'][model_name].keys() 
                                  for feature_dict in embedding_features 
                                    for model_name in feature_dict['features']}

            print(all_models_methods)

            # we want to output BY MODEL
            for model_name in all_models_methods:
                # and BY METHOD
                for method in all_models_methods[model_name]:
                    # each `feature_dict` corresponds to ONE sentence

                    collected = []
                    for feature_dict in embedding_features:

                        # all the keys that contain information such as the sentence, UID, filters used etc,
                        # except for the actual representations obtained from various models.
                        # we need to know this so we can package all this information together with the outputs by model and method
                        metadata_keys = {*feature_dict.keys()} - {'features'} # setminus operator
                        # make a copy of the feature_dict for this sentence excluding the representations themselves
                        meta_df = {key: feature_dict[key] for key in metadata_keys}
                        meta_df.update({'model_name': model_name, 'aggregation': method})
                        meta_df = pd.DataFrame(meta_df, index=[feature_dict['index']])
                        meta_df.columns = pd.MultiIndex.from_product([['metadata'], meta_df.columns, ['']])

                        # model_name -> method -> reprs
                        pooled_reprs = feature_dict['features']
                        flattened_repr = pooled_reprs[model_name][method]

                        collected += [pd.concat([meta_df, flattened_repr], axis=1)]

                    # create further subdirectories by model and aggregation method
                    (embedding_out / model_name / method).mkdir(parents=True, exist_ok=True)

                    sentence_df = pd.concat(collected, axis=0)

                    utils.io.log(f'outputting embedding dataframes for {model_name}-{method} to {embedding_out}')
                    if output_format == 'tsv':
                        sentence_df.to_csv(embedding_out / model_name / method / f'{sentence_features_filestem}.tsv', sep='\t', index=True)
                        # token_df.to_csv(embedding_out / f'{token_features_filestem}.tsv', sep='\t', index=False)
                    elif output_format == 'pkl':
                        sentence_df.to_pickle(embedding_out / model_name / method / f'{sentence_features_filestem}.pkl.gz', protocol=5)
                        # token_df.to_pickle(embedding_out / f'{token_features_filestem}.pkl.gz', protocol=5)


            utils.io.log(f'--- finished embedding pipeline')

        # Plot input data to benchmark data
        #utils.plot_usr_input_against_benchmark_dist_plots(df_benchmark, sent_embed)

        if process_semantic:
            pass


    ################################################################################
    #### \end{run_sentence_features_pipeline} ######################################
    ################################################################################
    return output_dir