Python One-Liner: Getting Only the First Match in a List Comprehension

Python’s list comprehensions are great, but I’ve found a new (to me) use of them: iterating over a list and returning the first match when there might be multiple possible matches. (To be more accurate, my solution uses a generator expression rather than a list comprehension)

In other words, I’m emulating a break statement in the loop, but only for the first match. In code, that is:

1
2
3
4
5
val = None
for x in some_list:
    if match(x):
        val = x
        break

I could do this with a list comprension and getting the element at index 0:

1
val = [x for x in some_list if match(x)][0]

…but that means the whole list is created, and what if the list is large and/or the match() function is expensive? I’d really like to just stop looping when I’ve got a match. Generator expressions come in handy here:

1
val = (x for x in some_list if match(x)).next()

Also note that in Python3, functions like filter() return an iterator (which may have a generator under the hood, I’m not sure), so this is possible:

1
val = next(filter(match, some_list))

Now I’ve got the first matching value, and I don’t have to match against every item in the list. Hooray for functional Python.

Update:
As pointed out by Emanuel Hoogeveen in the comments, my solution above only works if there is a match, otherwise it raises a StopIteration exception. The following provides a default value if there is no match (from StackOverflow via Emanuel):

1
val = next((x for x in some_list if match(x)), None)
Posted in Programming | Tagged , | 4 Comments

RTSP Stream Dumping with VLC

I’m taking a language class at my university, and there are sound files online that we use for our homework. The sound files are streamable through some embedded Quicktime player, without the option for a direct download. So I just need to install some Quicktime plugin for my browse–screw that, I’m gonna find a way to download the files.

So the URL for the page I can stream files from looks like this:

1
http://.../chinese/CHI_030/index_textbook.php

So I tried removing the “index_textbook.php” part, and luckily they didn’t block me from getting a directory listing. I see there’s a “mov/” subdirectory, so I go there, and voila! all the files right in front of me. But why are they .mov files? Aren’t they supposed to be audio files? I download one and try opening it. Each time I open it, it has to buffer for a few seconds. Moreover, the file is only 84 bytes. Ah, so the .mov file is probably just a container with a reference to some network resource. I open it to find out:

1
rtsptext rtsp://.../chinese//CHI_030/mov/CHI_030_018.mov

Yep, that was it. But what is RTSP? Just some streaming protocol; nothing interesting. Now how do I get the actual audio data? Here’s where VLC comes to the rescue. I look up some documentation on dumping streams. Let’s try a simple command (note, the “vlc://quit” at the end makes sure it terminates when it’s done, otherwise it will hang, waiting for more input):

1
cvlc rtsp://.../chinese//CHI_030/mov/CHI_030_018.mov --sout=file/mov:test.mov vlc://quit

(a few minutes later)

And there it is, an audio file! The first time I got a file with lots of skips and gaps, so be sure that you dump the stream as it is stored, and transcode later. In this case, the file was an mp4 audio file wrapped in a mov container. If that wasn’t the problem, try running the command again and make sure nothing else on your computer is using a lot of bandwidth.

Posted in Linux, Software | Tagged , | Leave a comment

Time/date conversion at the command line

When I didn’t trust my mental ability to convert dates and times, I relied on online tools like http://worldtimeserver.org/, but this task can easily be done from the command line (and without an internet connection). For instance, I am now in Taipei, and I want to know what time it will be here when it is 4:30pm in Pacific time. A quick search revealed the following use of the `date` command:

1
2
date --date='TZ="America/Los_Angeles" 2012-9-17 16:30'
Tue Sep 18 07:30:00 CST 2012

Using the date, it will also consider daylight saving’s time. Compare it with:

1
2
date --date='TZ="America/Los_Angeles" 2012-12-17 16:30'
Tue Dec 18 08:30:00 CST 2012

Now I see there is an example if the man-documentation of `date`, along with other options.

[UPDATE]

The above only converts from some other timezone to the one your shell is set to. If you want to go the other way around, you need to change the timezone (temporarily) when the command is run. For example, if you are in the Pacific timezone and want to convert a time to Singapore time, do the following (all on one line):

1
TZ=":Singapore" date --date='TZ="America/Los_Angeles" 2013-04-02 16:30'
Posted in Linux | Leave a comment

Accumulating dictionaries in Python

I often have a need to count tokens in a corpus. In Python, there are many ways to do this, but currently I most often use defaultdicts:

1
2
3
d = defaultdict(int)
for x in sequence:
  d[x] += 1

I would like to get rid of the for-loop and construct such a dictionary at once. I wrote a dict-derived class to do that, but it can do even more. But first, here is how I would do the above:

1
d = AccumulationDict(lambda x, y: x + y, [(x, 1) for x in sequence])

That’s it!

Notice how it takes a function as its first parameter. This is similar to how defaultdict takes a callable, but instead of taking a 0-arity callable, AccumulationDict takes a binary function. Whenever it “accumulates” a key-value for a key that already exists, the existing value and new value are sent to this function, and the result is what is set as the new value in the dictionary. This function will most likely be addition (rather than the lambda expression one could use operator.add), but it could be anything. Say you’re calculating probabilities of multiple events, you could use operator.mul.

I did not want to break KeyErrors, so accumulating is separate from getting and setting. This means you can still use __setitem__() and update() to reset the values of keys. Accumulation happens in the constructor, in dictionary addition, and with a new accumulate() function. accumulate() is identical to update() in interface, but uses the provided accumulator function to “merge” values when there are key collisions.

The code is below, but it represents a proof-of-concept and could be improved:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class AccumulationDict(dict):
    def __init__(self, accumulator, *args, **kwargs):
        if not hasattr(accumulator, '__call__'):
            raise TypeError('Accumulator must be a binary function.')
        self.accumulator = accumulator
        self.accumulate(*args, **kwargs)

    def __additem__(self, key, value):
        if key in self:
            self[key] = self.accumulator(self[key], value)
        else:
            self[key] = value

    def __add__(self, other):
        result = AccumulationDict(self.accumulator, self)
        result.accumulate(other)
        return result

    def accumulate(self, *args, **kwargs):
        for arg in args:
            if isinstance(arg, list):
                for (key, value) in arg:
                    self.__additem__(key, value)
            elif isinstance(arg, dict):
                for (key, value) in arg.items():
                    self.__additem__(key, value)
            else:
                raise TypeError('Argument must be of type list or dict.')
        for key in kwargs:
            self.__additem__(key, kwargs[key])
Posted in Programming, Software | Tagged , , | 3 Comments

Simple logging in Bash scripts

I couldn’t find much mention of logging utilities for Linux shell scripting (namely Bash), so I wrote my own fairly quickly. I wanted several functions for various levels of logging (info, debug, warning, errors, etc), and a way to adjust what levels can be displayed. I followed the fairly standard convention of using numeric values for these levels and setting a “verbosity” level. If you know of an existing solution for Bash, let me know in the comments. Anyway, here’s the main idea:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
exec 3>&2 # logging stream (file descriptor 3) defaults to STDERR
verbosity=2 # default to show warnings
silent_lvl=0
err_lvl=1
wrn_lvl=2
inf_lvl=3
dbg_lvl=4

notify() { log $silent_lvl "NOTE: $1"; } # Always prints
error() { log $err_lvl "ERROR: $1"; }
warn() { log $wrn_lvl "WARNING: $1"; }
inf() { log $inf_lvl "INFO: $1"; } # "info" is already a command
debug() { log $dbg_lvl "DEBUG: $1"; }
log() {
    if [ $verbosity -ge $1 ]; then
        # Expand escaped characters, wrap at 70 chars, indent wrapped lines
        echo -e "$2" | fold -w70 -s | sed '2~1s/^/  /' >&3
    fi
}

I added line wrapping and indenting so users don’t have to manually put in line-breaks to get nicer looking outputs, but there is the drawback that it makes the output harder to grep. Perhaps I should add an option to disable the line wrapping.

Here is a longer example with some simple option parsing (apologies for using getopt; I’m currently looking for a better solution).

Posted in Linux, Software | Tagged , , | 8 Comments

Lightweight Music

I was getting tired of the bloat and memory usage of Rhythmbox, so I was searching for a new music player/manager. After trying Muine and Decibel and not being totally satisfied, I finally (re)found MPD, the Music Player Daemon.

Once I figured out how to set it up (not too hard, but more work than usual), it’s been my favorite application for playing music. While GUI clients exist, I’ve been enjoying the command-line client MPC, as I can do things like this:

1
mpc listall | grep Sigur | mpc add

…which greps my whole music library for files with “Sigur” in the title, then adds them to the current playlist.

I also wanted to set up some key combos, so I didn’t have to execute commands in a terminal all the time. I used XBindKeys to bind some keys to mpc commands. For instance, the following commands toggles play/pause status (with Ctrl + the multimedia play button), skips to the next song (with Ctrl + multimedia next button), stops playback (with Ctrl + Alt + multimedia play button), and shows the currently playing song in a notification window (with Super + multimedia play button):

1
2
3
"mpc toggle"
  m:0x14 + c:172
  Control+Mod2 + XF86AudioPlay
1
2
3
"mpc next"
  m:0x14 + c:171
  Control+Mod2 + XF86AudioNext
1
2
3
"mpc stop"
  m:0x1c + c:172
  Control+Alt+Mod2 + XF86AudioPlay
1
2
3
"notify-send -u normal "$(mpc current)""
  m:0x50 + c:172
  Mod2+Mod4 + XF86AudioPlay

And what about memory usage? It’s currently playing a song and only using 8Mb. For comparison, I just fired up Rhythmbox and while idle it uses 47Mb.

Posted in Linux, Software | Tagged , , | Leave a comment

Tomboy and note-sharing

Other than as a place to quickly jot down ideas, phone numbers, etc., Tomboy‘s most common purpose for me is as a study aid for my coursework. The ability to link the different topics, concepts, and people that I learn about is very useful. Of course, I’m not the only student in my classes, and I introduced my study group to Tomboy and they love it. But now we have a new problem: note sharing. Since we are learning the same material, it would make sense if we could sync our notes together. And now we have another problem: we don’t want to share all of our notes, just the relevant ones.

Note-sharing is not a new idea, but it seems to have not been implemented yet. The basic ideas are that users of Tomboy should be able to get notes from others, give others their own notes, possibly sync them, resolve conflicts (as with synchronization), and be able to limit what notes are shared. Nobody seems to have mentioned real-time collaborative editing, but I don’t think it’s a good idea at this point, either.

I think there is not all that much that needs to be changed to allow sharing. Essentially, one must be able to synchronize subsets of notes, rather than all notes, and synchronize them in different locations (perhaps multiple). notebooks have been implemented, and they do a decent job of grouping notes, but they don’t add all that much functionally. I think users should be able to sync single notes, notebooks, or all notes with various locations. By doing so, I could keep all of my notes synchronized with my Ubuntu One account, and also sync my notebook for course notes to another server. My classmates can have an account with this other server, and by syncing with it we are sharing notes. If there’s a conflict, Tomboy should solve it in the simplest way possible (maybe prefix the conflicting lines with the username of the author?), or allow a user to launch an external diffing tool, such as Meld or WinDiff.

I’m sure there are issues I haven’t thought about too hard, such as how to deal with links to notes that don’t exist (e.g. outside of a shared notebook), what happens when users change, say, notebook names, Tomboy version or plugin mismatches, etc. But on the surface it seems that expanding synchronization to allow for syncing subsets of notes will allow for simple sharing.

Posted in Software | Tagged , , , , | Leave a comment

Traditional and Simplified Chinese in LaTeX

I’ve been in the habit of using LaTeX’s CJK environment across a whole document to allow me to insert, for example, Japanese anywhere I like. However, if you want to have more than one language (not covered by the same font) in the same document (such as both traditional and simplified Chinese, Japanese and Korean, etc), prepare for trouble. You cannot have one CJK environment for the whole document unless you get a font that can handle all the code points (like this elusive, half-free Cyberbit font that seems to be a pain to install).

I found, however, that there is another (and perhaps the intended) way to do it! You can create a command for each individual CJK environment you will need, and use them as needed. For example:

1
2
3
\newcommand{\zht}[1]{\begin{CJK}{UTF8}{bsmi}#1\end{CJK}}
\newcommand{\zhs}[1]{\begin{CJK}{UTF8}{gbsn}#1\end{CJK}}
\newcommand{\zh}[4]{\zht{#1}/\zhs{#2} (\emph{#3}) ``#4''
}

The \zht command is for traditional chinese characters, the \zhs is for simplified, and the \zh uses both (eg to define a word using both variants in hanzi, a transliteration, and a gloss). For example,

1
\zh{藝術}{艺术}{\yi4 \shu4}{art}

will produce

藝術/艺术 (yì shù) “art”

This works great, except that you have to use one of these tags every time you want to switch to a different character set.

NB: When using this method the very first line is not displayed. I got around this by having a dummy line near the top of the document. For example:

1
2
\zht{}  % Dummy environment to get around display bug.
\zht{藝術} % Now this will be displayed.
Posted in LaTeX | Tagged , , , | 7 Comments

glot

What started as an attempt to make a desktop application for CEDICT turned into an ambitious attempt to create an omniformat dictionary database and interface. glot aims to be both a backend for managing and querying dictionaries of any (electronic) format–even those over network protocols like DICT–and also an intelligent interface for querying a massive amount of data.

Goals of the project include:

  • allowing plugins for import and export formats and display settings
  • allowing users to query not just by word, but also by source and target languages, dictionary, and more
  • running glot as a desktop application, command-line application, or web server

These goals are far-reaching, but likely attainable.

The project is currently run by me and my good friend Mike. We’ve just started, so there isn’t much to show yet, but we’ll be developing along the “release early, release often” mantra, so features will be added incrementally.

Posted in Programming | Tagged , | 2 Comments

Latex, Python, and CairoPlot

CairoPlot is a Python module that uses the Cairo graphics package to produce great-looking charts easily. The results look really nice and are much simpler to create than many other packages, but up until now it has been suboptimal for use with LaTeX documents. I’ve been talking with the package maintainer and filing bugs about these inadequacies, and my concerns were quickly addressed. See the CairoPlot Launchpad page to view the bugs filed against it.

I also recently found a blog post about embedding python in LaTeX files [UPDATE: post appears to have gone offline. Here is a backup copy of python.sty, thanks to Steve Checkley]. Using this with CairoPlot, it is easy to put the chart-producing code directly in my .tex files and compile. This reduces the extra step of making separate python scripts to produce these charts. Here is some sample code from a paper I’m currently writing:

1
2
3
4
5
6
7
8
9
10
11
\begin{python}
import cairoplot
cairoplot.vertical_bar_plot(
  'dat/initial-grammar-stats.ps',
  [ [0.87, 0.82], [0.83, 0.50], [0.70, 0.11] ],
  340, 280, background = None, border = 10, grid = True,
  x_labels = ['Parses', 'Generates', 'Generates Original'],
  y_labels = ['%d%%' % i for i in range(0,110,10)],
  y_bounds = (0,1) )
print(r'\includegraphics{dat/initial-grammar-stats.ps}')
\end{python}

I do have some gripes about python.sty, though. I have to remember to use the latex command with the “–shell-escape” option. Also, it produces files for the python code, stderr and stdout output, and I don’t particularly care for the directory getting cluttered up with temporary files. Because of these annoyances, I might forgo the python.sty method and just keep a python script that generates all the charts.

Posted in Software, Uncategorized | Tagged , | 2 Comments