Client authors frequently deal with downloading entries the wrong way. There are various ways of getting entries from the servers, each with different merits.
First, support Unicode (UTF-8).
If you write a client and release it at all, it will be used by people who
need Unicode support. LiveJournal.com, and other LiveJournal installations, have a large
community of users that do not necessarily keep their journal in English.
The Russian community is huge, for example, and their journals require Unicode
to post/view the entries.
An example journal backup tool named
jbackup.pl
is available in the SVN repository. It shows how to download entries
and comments from the servers correctly and safely.
In general, there are four methods of downloading entries with the
getevents protocol mode: lastn,
syncitems, one,
and day. These four methods are specified in the
selecttype variable of the getevents
call.
lastn.
This is most effectively used when you are providing the user a snapshot of
their recent entries, or when you just want to get their most recently posted
entry, to verify the entry you just posted was posted, or you want to allow
the user to edit their most recent entry.
You should not use this mode to download an entire journal. You cannot specify a huge number (such as a number greater than fifty) that would give you their entire journal (unless their journal was a few dozen entries only).
day.
This is useful for people who are writing calendars and want to get entries on
a day that the user has clicked on. This should be used in conjunction with the
getdaycounts protocol mode to figure out when the user
has posted and then to get entries on that particular date.
You should never use this mode for
enumerating someone's journal and downloading their entries., nor
when you are going to re-upload the data. Always use syncitems.
If you do not specify a version, the
server will assume the client does not understand Unicode.
If, for some reason (non-Unicode client, for example), the server is unable
to send you a particular entry, it will instead send you text indicating the
entry's subject and body “(cannot be shown)”.
It does not tell you it has done this,
so you may end up thinking that is the user's real entry and overwrite
whatever they had.
one.
When you want to download a handful of entries scattered about, you can use
this mode to get them. It is usually safe to download an entry with this
mode and then to re-submit it to the server. Example: you use
getdaycounts to show a calendar, then you use the
day mode to show entries for that day,
then you use this mode to get the real entry for editing.
syncitems.
If you are trying to download someone's entire journal,
this is the mode to use. This mode is the only way you
can account for edits that the user has made to their entries without using
your client. This is also the most efficient way of downloading entries,
because the server will send you a bunch at a time (say, 100). This
mode is used in conjunction with the appropriately titled
syncitems client protocol mode.
The syncitems client protocol mode returns a list of
events modified/created/deleted after lastsync time, while
getevents using selecttype
syncitems returns the actual events.
The entries are returned in order of modification. So, in 2007 if you go back
and edit an entry from 1999, it will show up when you do a sync and specify a
lastsync of 2007. This is the only way to account for edits
that the user makes on the web site or with another client.
If you want to download and re-submit a particular group of entries, perhaps
within a particular time period, use syncitems.
Download the entire journal, then re-upload the subset you want to change.
A user may have used the site for a few years, writing many entries.
You will be hitting the server once per day for every day that the user has
had a journal, whether or not they posted. A day-by-day download might take
over a thousand separate requests, while a full
syncitems download would only be about ten.
It will substantially reduce the amount of hits to the server.
This is considerate, and also means your bot is not likely to get
itself banned for not being smart.
Here is a pseudo-code example of how to use this mode properly to download someone's entire journal.
send client request “syncitems” with the “lastsync” variable not specified get list of items back from request, save items into list for processing later while size_of_list < sync_total { find most recent time in list call “syncitems” again, but set “lastsync” to most recent time push result items onto lost } iterate through list and remove items that do not start with “L-” (L means “log” which is a journal entry) create hash of journal itemids with data { downloaded => 0, time => whatever sync_X_time was } while (any item in hash has downloaded == 0) { find the oldest “time” in this hash for items that have downloaded == 0 …decrement this time by one second. mark this item as downloaded (so we don't use the same time twice and loop forever) send client request “getevents” withselecttypeset tosyncitems,lastsyncset to oldest time minus 1 second mark each item you get back as downloaded in your hash put the entries you got into storage somewhere. }
You will have to call syncitems and
getevents several times each to get the data you need.
This is not a problem if you do it smartly. Also note that the server keeps
track of the times you use when you call getevents, and if
you start specifying the same time repeatedly (infinite loop) then
your client will be given an error message “Perhaps the
client is broken?”, or similar. Last, remember to set
ver to 1!